paul-rogers opened a new pull request, #13884:
URL: https://github.com/apache/druid/pull/13884

   Proposed is a revision to Druid’s error reporting structure. An [earlier 
PR](https://github.com/apache/druid/pull/13798) and 
[issue](https://github.com/apache/druid/issues/13123) introduced the concept of 
better error reporting. Discussion around that work refined and expanded the 
set of requirements: 
   
   1. Better error messages for Druid users.
   2. Build on patterns established by MSQ.
   3. Require that users think of the persona who can act on each error message.
   4. Unique error code for each error condition.
   5. Deployments can customize error messages.
   6. Easier to search for specific log messages in a log management system.
   
   The key new concepts in this PR are explained below.
   
   This PR is another draft of the concept to solicit further comment.
   
   #### `MSQFault`
   
   This PR models the `DruidException` class on the `MSQFault` class where 
feasible. [A previous 
note](https://github.com/apache/druid/pull/13798#issuecomment-1435377632) 
explained that `MSQFault` is tightly coupled to the MSQ error report framework 
and is thus not easily reusable. The present version of `DruidException` 
borrows the idea of creating subclasses for each kind of error, and of holding 
error “parameters” as distinct values, rather than only as fields in the 
message text. This PR also borrows the idea of creating groups of error 
subclass to represent error types.
   
   #### Personas
   
   Writing good error messages is hard. To encourage developers to write better 
messages, we require the developer to think about who can act on the message: 
the target persona. All errors are sent to the user, but the user may not be 
the person who can fix the problem. For example, if the network is down, user 
operations will fail, but the user is generally not the persona who can fix 
network issues.
   
   To help, every exception must have a target persona identified. Since 
classes of errors all tend to have the same target persona, the persona is 
generally set on a base class, which reduces the overhead of specifying the 
person each time we raise an exception.
   
   The persona isn’t actually used in this PR: we just require that it be set 
as a mental exercise.
   
   #### Error Translation
   
   Druid is used in many different ways. Developers tend to use Druid to create 
and debug new features. New users tend to use Druid on a single machine and 
play the role of user, app developer and system administrator. In these cases, 
the end user wants the full text of each error as they play a number of roles.
   
   Once Druid lands in production, these roles are spread out across many 
different people. Druid is often embedded in part of a larger application. In 
those cases, the application may have its own terminology. Deployments may also 
chose to redact certain details from user messages. We call this “error 
translation.” Even if the text remains in English, the wording is changed to be 
consistent with the overall application.
   
   To support these use cases, this PR introduces the idea of separating the 
message “template” from the error “parameters”. Druid provides a default 
template. Deployments can provide a file with customized templates. A number of 
features are required to realize this idea.
   
   #### Unique Error Codes
   
   To aid translation, we propose that all errors have a unique text error 
code. Errors from the `/sql` endpoint already have an error code, but the code 
is not unique or each kind of error. The existing code is also a bit ad-hoc and 
unstructured. This draft suggests a three-part pattern:
   
   * General area, such as “SQL” or “MSQ”
   * Sub-area, such as “Parse” or “Validation”
   * Detail code, such as “InvalidToken” or “InvalidConstant”.
   
   The result is a unique code such as `SQL-Parse-InvalidToken` or 
`SQL-Validation-InvalidConstant`.
   
   The errors map to an exception class hierarchy: we define a class for SQL 
Parse errors and another for SQL Validation errors. Each point in the code 
which raises an exception identifies the detail code. When the “same” error is 
defined in multiple places, a function can ensure that the various instances 
all use the same code. We could eventually create a centralized map of error 
codes if needed. Doing so seems overly complex at this time.
   
   #### Error Parameters
   
   Most exceptions include information. Which token is invalid? Which constant 
is invalid?
   
   `MSQFault` creates a subclass for each distinct error “type”, defined as a 
unique set of error parameters. The earlier PR used an error “context” to hold 
(a subset of) exception parameters. This version combines the two ideas. We 
hold error parameters as a map of key/value pairs. The reason for a map, rather 
than fields, will become clear shortly.
   
   We combine error message templates with error parameters using the Apache 
Commons 
[`StringSubstitutor`](https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html)
 class. For example:
   
   ```text
   Template: Line [${line}], Position [${column}]: Column [${name}] does not 
exist in any datasource in the query
   Values:   {“line”: 3, “Column”: 4, “name”: “foo”}
   Result:   Line 3, Position 4: Column [foo] does not exist in any datasource 
in the query
   ```
   
   Suppose an application prefers to use the words “field” and “table”. Suppose 
also the application generates the SQL, so that error position in the SQL are 
not useful to the user. This application can provide its own template:
   
   ```text
   Template: Field “${name}” not found in any table. Ensure the field is valid.
   Result:   Field “foo” not found in any table. Ensure the field is valid.
   ```
   
   Because `StringSubstitutor` needs a map of values, it turns out to be 
simpler to store error parameters as a map than as individual fields. If the 
parameters were fields, each error class would need code to build a map from 
those fields.
   
   Example of raising an error with parameters:
   
   ```java
     throw new SqlValidationError(
        "ColumnNotFound"
           "Line [${line}], Position [${column}]: Column [${name}] does not 
exist in any datasource in the query"
           )
         .withValue("line", lineNo)
         .withValue("column", col)
         .withValue("name", columnName);
   ```
   
   The `SqlValidationError` class provides the top and second-level error code: 
`SQL-Validation`. The parameters in the template match the names of the values 
provided. As explained below, order of the values is important.
   
   #### Translation Files
   
   To implement error translation, an application simply provides a custom 
"error catalog" in `conf/druid/errors.properties`. The file is a normal Java 
properties file that uses the error code as the key:
   
   
   ```text
   SQL-Validation-ColumnNotFound=Field “${name}” not found in any table. Ensure 
the field is valid.
   ```
   
   If the file is not found, or cannot be read, or does not contain the key for 
a given error, then the Druid default format is used instead.
   
   This is the simplest possible solution to get started: we can refine as we 
gain feedback.
   
   #### Log Message Format
   
   Given the above mechanism, we now have three distinct forms of error text:
   
   * The default Druid formatting
   * The translated message, which may redact some information
   * The logged form of the error
   
   The logged message should provide all information to the administrator or 
other "privileged" user who must support the system. The log message must 
contain these details even if they are not returned to the user. Error logging 
is controlled by the `getMessage()` method. Thus, this method must return a 
form of the error that does not do translation.
   
   As we improve error messages, we will add helpful text about how to resolve 
issues. That text need not appear in the log.
   
   When logs are fed into a log search system (such as Splunk or 
ElasticSearch), administrators want a simple way to search for a given error. 
Thus, the log message should include the error code.
   
   Given this, the log format of an error, given by the `getMessage()` method, 
is of the form:
   
   ```text
   <code>: <key>=[<value>]...
   ```
   
   For example:
   
   ```text
   SQL-Validation-ColumnNotFound: line=[3], column=[4], name=[foo]
   ```
   
   To ensure the log format is stable, the `values` map in each exception is a 
`LinkedHashMap`: values appear in the log in the order in which they are added 
to the map.
   
   #### `getMessage()` and Tests
   
   Since `getMessage()` must produce the log form of a message, we use 
`toString()` to produce the default Druid format for the message. This format 
is primarily for debugging and testing.
   
   Messages are returned to the user via the REST response. In the proposed 
code, we do not serialize `DruidException` directly. Instead, we perform a 
conversion step from `DruidException` to the form needed for each REST call. 
The `/sql` endpoint uses one format, other RPC messages use other formats. (The 
multiple error formats is more of a "bug" than "feature", but we'll fix that 
another time.)
   
   The mechanism to produce the `/sql` REST response also handles the 
translation. That mechanism requires Guice injection of the properties with the 
name of the translation file.
   
   Many Druid tests use the standard mechanism to verify error text: ask a 
JUnit or Hamcrest matcher to check the return value from `getMessage()`. These 
tests must change to use the log format. Or, they must change to use custom 
code to check the value of `toString()` to check the default Druid error text.
   
   Once we settle on a design, a future commit will standardize SQL test error 
checking.
   
   #### Third Party Errors
   
   Druid uses Calcite to parse SQL. Druid has no control over Calcite messages. 
In this case, we translate the Calcite exception to a `DruidException` with one 
parameter, `message`, which holds the original Calcite text. Such text cannot 
be translated using the mechanism proposed here, though the entire message can 
be redacted.
   
   In such cases, we could expand the error translator to include a function. 
That function can perform regular expression matches to "peek inside" the third 
party error, and reformat the message that way. The result is not elegant, but 
does solve the problem.
   
   #### Release note
   
   This PR is a draft and is a proposal. If this design turns out to be the 
final one, we'll update this section to include:
   
   * The text of some error messages will change.
   * The error code for `/sql` messages will change. The new IDs are unique.
   * Description of how to customize error messages using an error message 
catalog.
   
   <hr>
   
   This PR has:
   
   - [X] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [X] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [X] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [X] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to