[GitHub] [druid] paul-rogers opened a new pull request, #13884: Draft 2 of the Proposed Revised Error Mechanism

via GitHub Sat, 04 Mar 2023 18:22:00 -0800


paul-rogers opened a new pull request, #13884:
URL: https://github.com/apache/druid/pull/13884

Proposed is a revision to Druid’s error reporting structure. An [earlier
PR](https://github.com/apache/druid/pull/13798) and
[issue](https://github.com/apache/druid/issues/13123) introduced the concept of
better error reporting. Discussion around that work refined and expanded the
set of requirements:

1. Better error messages for Druid users.
2. Build on patterns established by MSQ.
3. Require that users think of the persona who can act on each error message.
4. Unique error code for each error condition.
5. Deployments can customize error messages.
6. Easier to search for specific log messages in a log management system.

The key new concepts in this PR are explained below.

This PR is another draft of the concept to solicit further comment.

#### `MSQFault`

This PR models the `DruidException` class on the `MSQFault` class where
feasible. [A previous
note](https://github.com/apache/druid/pull/13798#issuecomment-1435377632)
explained that `MSQFault` is tightly coupled to the MSQ error report framework
and is thus not easily reusable. The present version of `DruidException`
borrows the idea of creating subclasses for each kind of error, and of holding
error “parameters” as distinct values, rather than only as fields in the
message text. This PR also borrows the idea of creating groups of error
subclass to represent error types.

#### Personas

Writing good error messages is hard. To encourage developers to write better
messages, we require the developer to think about who can act on the message:
the target persona. All errors are sent to the user, but the user may not be
the person who can fix the problem. For example, if the network is down, user
operations will fail, but the user is generally not the persona who can fix
network issues.

To help, every exception must have a target persona identified. Since
classes of errors all tend to have the same target persona, the persona is
generally set on a base class, which reduces the overhead of specifying the
person each time we raise an exception.

The persona isn’t actually used in this PR: we just require that it be set
as a mental exercise.

#### Error Translation

Druid is used in many different ways. Developers tend to use Druid to create
and debug new features. New users tend to use Druid on a single machine and
play the role of user, app developer and system administrator. In these cases,
the end user wants the full text of each error as they play a number of roles.

Once Druid lands in production, these roles are spread out across many
different people. Druid is often embedded in part of a larger application. In
those cases, the application may have its own terminology. Deployments may also
chose to redact certain details from user messages. We call this “error
translation.” Even if the text remains in English, the wording is changed to be
consistent with the overall application.

To support these use cases, this PR introduces the idea of separating the
message “template” from the error “parameters”. Druid provides a default
template. Deployments can provide a file with customized templates. A number of
features are required to realize this idea.

#### Unique Error Codes

To aid translation, we propose that all errors have a unique text error
code. Errors from the `/sql` endpoint already have an error code, but the code
is not unique or each kind of error. The existing code is also a bit ad-hoc and
unstructured. This draft suggests a three-part pattern:

* General area, such as “SQL” or “MSQ”
* Sub-area, such as “Parse” or “Validation”
* Detail code, such as “InvalidToken” or “InvalidConstant”.

The result is a unique code such as `SQL-Parse-InvalidToken` or
`SQL-Validation-InvalidConstant`.

The errors map to an exception class hierarchy: we define a class for SQL
Parse errors and another for SQL Validation errors. Each point in the code
which raises an exception identifies the detail code. When the “same” error is
defined in multiple places, a function can ensure that the various instances
all use the same code. We could eventually create a centralized map of error
codes if needed. Doing so seems overly complex at this time.

#### Error Parameters

Most exceptions include information. Which token is invalid? Which constant
is invalid?

`MSQFault` creates a subclass for each distinct error “type”, defined as a
unique set of error parameters. The earlier PR used an error “context” to hold
(a subset of) exception parameters. This version combines the two ideas. We
hold error parameters as a map of key/value pairs. The reason for a map, rather
than fields, will become clear shortly.

We combine error message templates with error parameters using the Apache
Commons
[`StringSubstitutor`](https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/StringSubstitutor.html)
class. For example:

```text
Template: Line [${line}], Position [${column}]: Column [${name}] does not
exist in any datasource in the query
Values: {“line”: 3, “Column”: 4, “name”: “foo”}
Result: Line 3, Position 4: Column [foo] does not exist in any datasource
in the query
```

Suppose an application prefers to use the words “field” and “table”. Suppose
also the application generates the SQL, so that error position in the SQL are
not useful to the user. This application can provide its own template:

```text
Template: Field “${name}” not found in any table. Ensure the field is valid.
Result: Field “foo” not found in any table. Ensure the field is valid.
```

Because `StringSubstitutor` needs a map of values, it turns out to be
simpler to store error parameters as a map than as individual fields. If the
parameters were fields, each error class would need code to build a map from
those fields.

Example of raising an error with parameters:

```java
throw new SqlValidationError(
"ColumnNotFound"
"Line [${line}], Position [${column}]: Column [${name}] does not
exist in any datasource in the query"
)
.withValue("line", lineNo)
.withValue("column", col)
.withValue("name", columnName);
```

The `SqlValidationError` class provides the top and second-level error code:
`SQL-Validation`. The parameters in the template match the names of the values
provided. As explained below, order of the values is important.

#### Translation Files

To implement error translation, an application simply provides a custom
"error catalog" in `conf/druid/errors.properties`. The file is a normal Java
properties file that uses the error code as the key:

```text
SQL-Validation-ColumnNotFound=Field “${name}” not found in any table. Ensure
the field is valid.
```

If the file is not found, or cannot be read, or does not contain the key for
a given error, then the Druid default format is used instead.

This is the simplest possible solution to get started: we can refine as we
gain feedback.

#### Log Message Format

Given the above mechanism, we now have three distinct forms of error text:

* The default Druid formatting
* The translated message, which may redact some information
* The logged form of the error

The logged message should provide all information to the administrator or
other "privileged" user who must support the system. The log message must
contain these details even if they are not returned to the user. Error logging
is controlled by the `getMessage()` method. Thus, this method must return a
form of the error that does not do translation.

As we improve error messages, we will add helpful text about how to resolve
issues. That text need not appear in the log.

When logs are fed into a log search system (such as Splunk or
ElasticSearch), administrators want a simple way to search for a given error.
Thus, the log message should include the error code.

Given this, the log format of an error, given by the `getMessage()` method,
is of the form:

```text
<code>: <key>=[<value>]...
```

For example:

```text
SQL-Validation-ColumnNotFound: line=[3], column=[4], name=[foo]
```

To ensure the log format is stable, the `values` map in each exception is a
`LinkedHashMap`: values appear in the log in the order in which they are added
to the map.

#### `getMessage()` and Tests

Since `getMessage()` must produce the log form of a message, we use
`toString()` to produce the default Druid format for the message. This format
is primarily for debugging and testing.

Messages are returned to the user via the REST response. In the proposed
code, we do not serialize `DruidException` directly. Instead, we perform a
conversion step from `DruidException` to the form needed for each REST call.
The `/sql` endpoint uses one format, other RPC messages use other formats. (The
multiple error formats is more of a "bug" than "feature", but we'll fix that
another time.)

The mechanism to produce the `/sql` REST response also handles the
translation. That mechanism requires Guice injection of the properties with the
name of the translation file.

Many Druid tests use the standard mechanism to verify error text: ask a
JUnit or Hamcrest matcher to check the return value from `getMessage()`. These
tests must change to use the log format. Or, they must change to use custom
code to check the value of `toString()` to check the default Druid error text.

Once we settle on a design, a future commit will standardize SQL test error
checking.

#### Third Party Errors

Druid uses Calcite to parse SQL. Druid has no control over Calcite messages.
In this case, we translate the Calcite exception to a `DruidException` with one
parameter, `message`, which holds the original Calcite text. Such text cannot
be translated using the mechanism proposed here, though the entire message can
be redacted.

In such cases, we could expand the error translator to include a function.
That function can perform regular expression matches to "peek inside" the third
party error, and reformat the message that way. The result is not elegant, but
does solve the problem.

#### Release note

This PR is a draft and is a proposal. If this design turns out to be the
final one, we'll update this section to include:

* The text of some error messages will change.
* The error code for `/sql` messages will change. The new IDs are unique.
* Description of how to customize error messages using an error message
catalog.

<hr>

This PR has:

- [X] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [X] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [X] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [X] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers opened a new pull request, #13884: Draft 2 of the Proposed Revised Error Mechanism

Reply via email to