kfaraz opened a new issue #11165:
URL: https://github.com/apache/druid/issues/11165


   ## Motivation
   
   Task and query failures in Druid are often difficult to analyze due to 
missing, incomplete or vague error messages.
   
   A unified error reporting mechanism would improve the experience of a Druid 
user through:
   - Easier debugging and RCA (without looking at server logs)
   - Richer error messages detailing what went wrong and possible actions for 
mitigation
   - Homogeneous error reporting across different Druid services, modules and 
extensions
   - Specifying the severity of errors and other potential side effects
   - Hiding implementation and other sensitive details from the end user
   
   ## Overview of Changes
   
   ### New Classes
   
   - `ErrorTypeProvider`: Multi-bound interface to be implemented by core Druid 
as well as any extensions that needs to register error types
     - `String getModuleName()`: Namespace denoting name of the extension (or 
`"druid"` in case of core Druid). Must be unique across extensions.
     - `List<ErrorType> getErrorTypes()`: List of error types for the extension
   
   - `ErrorType`: Denotes a specific category of an error
     - `int code`: Integer code denoting a specific type of error within the 
namespace. Must be unique within the module.
     - `String messageFormat`: Contains placeholders that can be replaced to 
get the full error message 
     - additional details e.g. severity
   
   - `ErrorTypeParams`: Denotes the occurrence of an error. Contains params to 
identify and format the actual `ErrorType` 
     - `String moduleName`
     - `int code`
     - `List<String> messageArgs`: total length of args is limited (current 
limit on `TaskStatus.errorMsg` is 100)
   
   - `DruidTypedException`: exception that corresponds to an error type
     - `ErrorTypeParams errorTypeParams`
     - `Throwable cause`: optional
   
   - `ErrorMessageFormatter`: (singleton) class that maintains an in-memory 
mapping from `(moduleName, code)` pair to `ErrorType`
   
   ### Flow
   
   - Core Druid and extensions register their respective error types on startup 
on Overlord (extensions that are not loaded on Overlord have been addressed 
later)
   - An in-memory mapping is maintained from `(moduleName, code)` pair to the 
respective `ErrorType`
   - The persisted `TaskStatus` of any failed task contains an 
`ErrorTypeParams` rather than the full error message
   - When the status of a Task is requested, the `ErrorTypeParams` of the 
`TaskStatus` are used by the `ErrorMessageFormatter` to construct the full 
error message, which is then sent back in the API response
   
   ## Code Snippets
   
   ### Throwing an Exception
   
   e.g., for an extension `kafka-emitter`:
   ```java
   final String topicName = ...;
   try {
        // ...
        // Execution happens here
        // ...
   } catch (InvalidTopicException topicEx) {
        throw new DruidTypedException(
                ErrorTypeParams.of(
                        KafkaEmitterErrorTypes.MODULE_NAME, // "kafka-emitter"
                        KafkaEmitterErrorTypes.INVALID_TOPIC, // integer error 
code
                        // message arguments
                        topicName),
                topicEx
        );
   }
   ```
   
   ### Registering Error Types
   
   Binding the ErrorTypeProvider
   ```java
   @Override
   public void configure(Binder binder) {
        Multibinder.newSetBinder(binder, 
ErrorTypeProvider.class).addBinding().to(KafkaEmitterErrorTypeProvider.class);
   }
   ```
   
   Listing the error types
   ```java
   public class KafkaEmitterErrorTypeProvider {
   
        @Override
        public String getModuleName() {
                return KafkaEmitterErrorTypes.MODULE_NAME; // "kafka-emitter";
        }
   
        @Override
        public List<ErrorType> getErrorTypes() {
                // return the list of all error types for this extension
                return Arrays.asList(
                        ...
                        // example error type for invalid topic
                        ErrorType.of(
                                KafkaEmitterErrorTypes.INVALID_TOPIC, 
                                "The given topic name [%s] is invalid. Please 
provide a valid topic name.")
                        ...
                );
        }
   
   }
   ```
   
   Mapping error codes to types:
   ```java
   public class ErrorMessageFormatter {
   
        private final Map<String, Map<Integer, ErrorType>> moduleToErrorTypes = 
...;
   
        @Inject
        public ErrorMessageFormatter(
                Set<ErrorTypeProvider> errorTypeProviders) {
                
                for (ErrorTypeProvider provider : errorTypeProviders) {
                        // Ensure that there are no module name clashes
                        final String moduleName = provider.getModuleName();
   
                        // Add all error types to the map
                        Map<Integer, ErrorType> errorTypeMap = new 
ConcurrentHashMap<>();
                        for (ErrorType errorType : provider.getErrorTypes()) {
                                errorTypeMap.put(errorType.getCode(), 
errorType);
                        }
                }
        }
   }
   
   ```
   
   ### Building the full error message (to serve UI requests)
   
   ```java
   public class ErrorMessageFormatter {
   
        ...
   
        public String getErrorMessage(ErrorTypeParams errorParams) {
                ErrorType errorType = moduleToErrorTypes
                                                                
.get(errorParams.getModuleName())
                                                                
.get(errorParams.getCode());
   
                return String.format(
                        errorType.getMessageFormat(),
                        errorParams.getMessageArgs().toArray());
        }
   
        ...
   
   }
   
   
   ```
   
   ## Design Concerns
   Pros of using error codes:
   - Any operation that involves persisting of a task/query status would have a 
smaller memory/disk footprint.
   - Less verbose logs
   
   ### Task Failures
   Ingestion and compaction tasks are managed by the Overlord. Thus, the 
Overlord needs to be aware of the error types to be able to serve task statuses 
over REST APIs.
   
   ### Query Failures
   Queries (SQL and native) are submitted over HTTP connections and the 
response can contain the detailed error message in case of failures. Thus the 
Broker need not be aware of the list of error types as there is no persistence 
of query status (and hence no requirement of persisting error codes and 
formatting the error messages when requested).
   
   ### Extensions that are not loaded on Overlord
   There are several extensions in Druid which are not loaded on the Overlord 
and run only on the Middle Managers/Peons. As these are not loaded on the 
Overlord, it is not aware of the error types that these extensions can throw.
   
   The approach here can be similar to that in Query Failures above. While 
communicating to the Overlord, the Middle Manager can send back both the 
`ErrorType` object (denotes the category of the error) and the 
`ErrorTypeParams` (denotes a specific error event). The Overlord can then 
persist the received `ErrorTypeParams` in its task status while also adding an 
entry to its error type mappings.
   
   ### Storing the mappings from Error Code to Error Type
   In the design discussed above, the error types are maintained in-memory (in 
the Overlord). If extensions register too many error codes for rare scenarios, 
it would have an unnecessarily large memory usage which could have been used 
otherwise.
   
   An alternative approach could be to persist the error types in the metadata 
store accessed via a small in-memory cache.
   
   Pros:
   - Only the frequently occurring error types would be present in the warmed 
up cache.
   - Central repo for all error types that can be accessed by both Overlord and 
Coordinator
   
   Cons:
   - De-duplication of module names and integer error codes would be more 
expensive
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to