kfaraz opened a new issue #11165:
URL: https://github.com/apache/druid/issues/11165
## Motivation
Task and query failures in Druid are often difficult to analyze due to
missing, incomplete or vague error messages.
A unified error reporting mechanism would improve the experience of a Druid
user through:
- Easier debugging and RCA (without looking at server logs)
- Richer error messages detailing what went wrong and possible actions for
mitigation
- Homogeneous error reporting across different Druid services, modules and
extensions
- Specifying the severity of errors and other potential side effects
- Hiding implementation and other sensitive details from the end user
## Overview of Changes
### New Classes
- `ErrorTypeProvider`: Multi-bound interface to be implemented by core Druid
as well as any extensions that needs to register error types
- `String getModuleName()`: Namespace denoting name of the extension (or
`"druid"` in case of core Druid). Must be unique across extensions.
- `List<ErrorType> getErrorTypes()`: List of error types for the extension
- `ErrorType`: Denotes a specific category of an error
- `int code`: Integer code denoting a specific type of error within the
namespace. Must be unique within the module.
- `String messageFormat`: Contains placeholders that can be replaced to
get the full error message
- additional details e.g. severity
- `ErrorTypeParams`: Denotes the occurrence of an error. Contains params to
identify and format the actual `ErrorType`
- `String moduleName`
- `int code`
- `List<String> messageArgs`: total length of args is limited (current
limit on `TaskStatus.errorMsg` is 100)
- `DruidTypedException`: exception that corresponds to an error type
- `ErrorTypeParams errorTypeParams`
- `Throwable cause`: optional
- `ErrorMessageFormatter`: (singleton) class that maintains an in-memory
mapping from `(moduleName, code)` pair to `ErrorType`
### Flow
- Core Druid and extensions register their respective error types on startup
on Overlord (extensions that are not loaded on Overlord have been addressed
later)
- An in-memory mapping is maintained from `(moduleName, code)` pair to the
respective `ErrorType`
- The persisted `TaskStatus` of any failed task contains an
`ErrorTypeParams` rather than the full error message
- When the status of a Task is requested, the `ErrorTypeParams` of the
`TaskStatus` are used by the `ErrorMessageFormatter` to construct the full
error message, which is then sent back in the API response
## Code Snippets
### Throwing an Exception
e.g., for an extension `kafka-emitter`:
```java
final String topicName = ...;
try {
// ...
// Execution happens here
// ...
} catch (InvalidTopicException topicEx) {
throw new DruidTypedException(
ErrorTypeParams.of(
KafkaEmitterErrorTypes.MODULE_NAME, // "kafka-emitter"
KafkaEmitterErrorTypes.INVALID_TOPIC, // integer error
code
// message arguments
topicName),
topicEx
);
}
```
### Registering Error Types
Binding the ErrorTypeProvider
```java
@Override
public void configure(Binder binder) {
Multibinder.newSetBinder(binder,
ErrorTypeProvider.class).addBinding().to(KafkaEmitterErrorTypeProvider.class);
}
```
Listing the error types
```java
public class KafkaEmitterErrorTypeProvider {
@Override
public String getModuleName() {
return KafkaEmitterErrorTypes.MODULE_NAME; // "kafka-emitter";
}
@Override
public List<ErrorType> getErrorTypes() {
// return the list of all error types for this extension
return Arrays.asList(
...
// example error type for invalid topic
ErrorType.of(
KafkaEmitterErrorTypes.INVALID_TOPIC,
"The given topic name [%s] is invalid. Please
provide a valid topic name.")
...
);
}
}
```
Mapping error codes to types:
```java
public class ErrorMessageFormatter {
private final Map<String, Map<Integer, ErrorType>> moduleToErrorTypes =
...;
@Inject
public ErrorMessageFormatter(
Set<ErrorTypeProvider> errorTypeProviders) {
for (ErrorTypeProvider provider : errorTypeProviders) {
// Ensure that there are no module name clashes
final String moduleName = provider.getModuleName();
// Add all error types to the map
Map<Integer, ErrorType> errorTypeMap = new
ConcurrentHashMap<>();
for (ErrorType errorType : provider.getErrorTypes()) {
errorTypeMap.put(errorType.getCode(),
errorType);
}
}
}
}
```
### Building the full error message (to serve UI requests)
```java
public class ErrorMessageFormatter {
...
public String getErrorMessage(ErrorTypeParams errorParams) {
ErrorType errorType = moduleToErrorTypes
.get(errorParams.getModuleName())
.get(errorParams.getCode());
return String.format(
errorType.getMessageFormat(),
errorParams.getMessageArgs().toArray());
}
...
}
```
## Design Concerns
Pros of using error codes:
- Any operation that involves persisting of a task/query status would have a
smaller memory/disk footprint.
- Less verbose logs
### Task Failures
Ingestion and compaction tasks are managed by the Overlord. Thus, the
Overlord needs to be aware of the error types to be able to serve task statuses
over REST APIs.
### Query Failures
Queries (SQL and native) are submitted over HTTP connections and the
response can contain the detailed error message in case of failures. Thus the
Broker need not be aware of the list of error types as there is no persistence
of query status (and hence no requirement of persisting error codes and
formatting the error messages when requested).
### Extensions that are not loaded on Overlord
There are several extensions in Druid which are not loaded on the Overlord
and run only on the Middle Managers/Peons. As these are not loaded on the
Overlord, it is not aware of the error types that these extensions can throw.
The approach here can be similar to that in Query Failures above. While
communicating to the Overlord, the Middle Manager can send back both the
`ErrorType` object (denotes the category of the error) and the
`ErrorTypeParams` (denotes a specific error event). The Overlord can then
persist the received `ErrorTypeParams` in its task status while also adding an
entry to its error type mappings.
### Storing the mappings from Error Code to Error Type
In the design discussed above, the error types are maintained in-memory (in
the Overlord). If extensions register too many error codes for rare scenarios,
it would have an unnecessarily large memory usage which could have been used
otherwise.
An alternative approach could be to persist the error types in the metadata
store accessed via a small in-memory cache.
Pros:
- Only the frequently occurring error types would be present in the warmed
up cache.
- Central repo for all error types that can be accessed by both Overlord and
Coordinator
Cons:
- De-duplication of module names and integer error codes would be more
expensive
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]