LakshSingla opened a new issue, #14871:
URL: https://github.com/apache/druid/issues/14871

   ## Unifying MSQ's exception system and the new `DruidException` 
   
   ### Motivation
   Currently, MSQ uses the `MSQFault` classes to represent a fault that has 
occurred while executing an MSQ task. 
https://github.com/apache/druid/pull/14004 recently introduced the 
`DruidException` class, which is an attempt to re-structuring the user-visible 
errors. It makes sense that we should either unify the MSQ's exceptions with 
`DruidExceptions` or provide an analog in MSQ for the latter since these faults 
are delivered to the end user. This feature intends to lay out a proposal for 
unifying MSQ's exceptions with `DruidExceptions`.
   
   ### Current issue with the MSQ's exceptions
   Currently, many of the MSQ exceptions mention the error appropriately for 
someone who has knowledge of MSQ or has been using it for a while, however, 
they fail to provide an alternative to the user-facing the error. Consequently, 
if someone has seen and resolved an MSQ exception earlier, it becomes easy for 
them to debug the cause, while for a new user, it becomes difficult to figure 
out an alternative. There have been some PRs that improve the error messaging 
around existing faults (https://github.com/apache/druid/pull/13794, 
https://github.com/apache/druid/pull/14475, 
https://github.com/apache/druid/pull/14511). This change would ensure that 
MSQ's faults and their messaging are constrained to the same conventions as we 
want the `DruidException` to be.
   
   The aim of the change would be to align the exceptions generated by MSQ to 
DruidExceptions, while also allowing MSQ to use these faults to perform its 
functions (the main ones being worker retry and generating error reports).
   
    ### MSQ's Exception System
   
   MSQ's exception system consists of 2 main classes:
   1. `MSQException` - This is a wrapper around the `MSQFaults`. It doesn't 
contain any logic of its own and works to extract the exception message from 
the `MSQFault`. There is however a hidden dependency of the worker retry on the 
error message - We deduce the fault type from the message in 
`MSQFaultUtils#getErrorCodeFromMessage`. Therefore any change should ensure 
that this function still performs correctly.
   2. `MSQFault` & `BaseMSQFault` - Parent interface and class from all the 
individual faults are derived. These contain the relevant information to return 
the error code and the error message. 
   
   
   ### Unifying
   Unifying the two separate error systems, in my opinion, should involve 
unifying the two interfaces
   1. `ErrorResponse` with `MSQFault` - Both of these classes can be used to 
represent the information across the wire, and both act as a precursor to the 
final exception message that gets thrown.
   2. `DruidException` with `MSQException` - These represent the final 
exceptions that are thrown to the end user. Both of them can't be serded, and 
both are a representation of the high-level exception message that gets 
propagated to the user.
   
   #### 1. Making `MSQException` a subclass of the `DruidException`
   The end goal should be that we can construct a `DruidException` from 
`MSQException`, which would ensure that MSQ's errors conform to the norms of 
`DruidException`. Subclassing would ensure that we always have the necessary 
info `errorCode`, `targetPersona`, and `category` while constructing the 
`MSQException` (which are necessary fields in the `DruidExceptions`. Also, 
having distinct classes would ensure that 
    a) User-facing impact is minimum 
    b) We don't have to tweak MSQ logic relying on fetching the `MSQFault` from 
the exception. (eg: `MSQErrorReport#getFaultFromException`)
   
   #### 2. Creating a way to construct `ErrorResponse` from the `MSQFault`
   We would need to expose Persona and Category from the fault, which can be 
done as minor implementation detail below. We would need to maintain 
compatibility while upgrading/downgrading between workers and the controller
   1. Extend the MSQFault to expose Persona (`Persona getPersona()` and 
Category (`Category getCategory()`)
   2. Extend the MSQFault to expose a map containing Persona and Category 
   
   **_Open question: Would the MSQ's error code be the same as 
`ErrorResponse#errorCode`?_**
   
   #### 3. Cleaning up the `MSQFault` before returning it to the report
   Before adding the fault to the report, we should clear the extra information 
(category, persona) present in the `MSQFault` that helped us create the 
`MSQException`. This isn't exposed to the end user while showing 
`DruidException` therefore MSQ should remove it as well. 
   
   #### 4. Updating the error message of various classes - There has been some 
work done already to make the most commonly encountered error messages more 
informative. The initiative would involve converting the rest of them as well 
(or probably in follow-up patches) 
   
   ### Summary of above
   ```java
   
   @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "errorCode")
   public interface MSQFault
   {
   ....
     DruidException.Persona getPersona();
   
     DruidException.Category getCategory();
   
     String getErrorCode();
   }
   
   public class MSQException extends DruidException
   {
   ....
     public MSQException(
         @Nullable final Throwable cause,
         final MSQFault fault
     )
     {
       super(cause, fault.getErrorCode(), fault.getPersona(), 
fault.getCategory(), MSQFaultUtils.generateMessageWithErrorCode(fault));
       this.fault = fault;
     }
   
   
   }
   
   ```
   
   ### Dev impact
   Any new fault added to MSQ will conform to the same restrictions on the 
`DruidExceptions`, and hopefully will lead to better and more actionable error 
messages.
   
   ### User-facing impact
   The user-facing impact would be kept minimal assuming that someone doesn't 
pattern match the fault's exception message. We should ensure that the error 
code that we return doesn't change. From my understanding, we won’t be using 
the error codes set by the `Category` anywhere in the MSQ layer therefore we 
won’t be changing this.
   
   ### Migration
   While upgrading or downgrading a cluster between versions that have and 
don't have this change, we should ensure that we are maintaining compatibility 
between the different tasks. Since we won't be changing any on the wire 
transfer format, there shouldn't be any issue. 
   
   #### 1. The controller is on a newer version, while the worker is on an 
older version
   The controller would receive an error report (wrapping an MSQ fault) without 
the required fields. The controller should make sure to create an 
`MSQException` with default parameters of the new fields in that case
   
   #### 2. The controller is on an older version, while the worker is on a 
newer version
   The controller would receive an error report (wrapping an MSQ fault) with 
extra fields. While serializing it back to the MSQFault, the controller would 
lose those fields and continue to work on the fault as if nothing has changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to