[GitHub] [druid] paul-rogers commented on a diff in pull request #13798: Introduce a DruidException for better user error messages

via GitHub Fri, 17 Feb 2023 17:57:04 -0800


paul-rogers commented on code in PR #13798:
URL: https://github.com/apache/druid/pull/13798#discussion_r1110438887



##########
processing/src/main/java/org/apache/druid/error/DruidException.java:
##########
@@ -0,0 +1,359 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.error;
+
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.logger.Logger;
+
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.Map;
+
+/**
+ * Represents an error condition exposed to the user and/or operator of Druid.
+ * Not needed for purely internal exceptions thrown and caught within Druid 
itself.
+ * There are categories of error that determine the general form of corrective
+ * action, and also determine HTTP (or other API) status codes.
+ * <p>
+ * Druid exceptions can contain context. Use the context for details, such as
+ * file names, query context variables, symbols, etc. This allows the error
+ * message itself to be simple. Context allows consumers to filter out various
+ * bits of information that a site does not wish to expose to the user, while
+ * still logging the full details. Typical usage:
+ * <pre><code>
+ * if (something_is_wrong) {
+ *   throw DruidException.user("File not found")
+ *       .context("File name", theFile.getName())
+ *       .context("Directory", theFile.getParent())
+ *       .build();
+ * }
+ * </code></pre>
+ * <p>
+ * Exceptions are immutable. In many cases, an error is thrown low in the code,
+ * bit context is known at a higher level. In this case, the higher code should
+ * catch the exception, convert back to a builder, add context, and throw the
+ * new exception. The original call stack is maintained. Example:
+ * <pre><code>
+ * catch (DruidExceptin e) {
+ *   throw e.toBuilder().
+ *       .context("File name", theFile.getName())
+ *       .context("Directory", theFile.getParent())
+ *       .build();
+ * }
+ * </code></pre>
+ */
+public class DruidException extends RuntimeException
+{
+  public enum ErrorType
+  {
+    /**
+     * General case of an error due to something the user asked to do in an 
REST
+     * request. Translates to an HTTP status 400 (BAD_REQUET) for a REST call
+     * (or the equivalent for other APIs.)
+     */
+    USER,
+
+    /**
+     * Special case of a user error where a resource is not found and we wish
+     * to return a 404 (NOT_FOUND) HTTP status (or the equivalent for other
+     * APIs.)
+     */
+    NOT_FOUND,
+
+    /**
+     * Error due to a problem beyond the user's control, such as an assertion
+     * failed, unsupported operation, etc. These indicate problems with the 
software
+     * where the fix is either a workaround or a bug fix. Such error should 
only
+     * be raised for "should never occur" type situations.
+     */
+    SYSTEM,
+
+    /**
+     * Error for a resource limit: memory, CPU, slots or so on. The workaround 
is
+     * generally to try later, get more resources, reduce load or otherwise 
resolve
+     * the resource pressure issue.
+     */
+    RESOURCE,
+
+    /**
+     * Similar to RESOURCE, except indicates a timeout, perhaps due to load, 
due
+     * to an external system being unavailable, etc.
+     */
+    TIMEOUT,
+
+    /**
+     * Error in configuration. Indicates that the administrator made a mistake 
during
+     * configuration or setup. The solution is for the administrator (not the 
end user)
+     * to resolve the issue.
+     */
+    CONFIG,
+
+    /**
+     * Indicates a network error of some kind: intra-Druid, client-to-Druid,
+     * Druid-to-external system, etc. Generally the end user cannot fix these 
errors:
+     * it requires a DevOps person to resolve.
+     */
+    NETWORK
+  };

Review Comment:
   As it turns out, the topics above are _supposed_ to represent personas. The 
`USER` category is something that (we believe) the user caused by virtue of 
something just requested. Most likely the user has to take action.
   
   A `CONFIG` error is targeted at the sys admin: they configured something 
wrong and they have to change a config (or align an external system with the 
config) to fix the error.
   
   A `SYSTEM` error is targeted at us: something is wrong in the code. The user 
will get the error, but all they can do is turn around, forward the message to 
us and ask us to fix whatever it is that went wrong.
   
   A `NETWORK` error _could_ be the user (they gave us a bogus HTTP input 
source URL), could be an admin (they didn't open a port required by Druid) or a 
generic error (someone with a backhoe just cut the fiber optic link.) If we can 
figure out which if these cases is occurring, we could better target the error. 
Do you know of a way to know this? Are our network errors clearly delimited 
split out in the code based on who should take action?
   
   For others it gets even muddier. A `RESOURCE` error can be seen as a user 
error ("hey! Don't query so many rows"), a load error ("there just isn't enough 
load to go around: go away and come back later."), a config issue ("1 GB is a 
bit low for this workload"), etc. Any suggestions for how we'd know, in the 
code, to whom to attribute the cause of the lack of resources?
   
   Given all this, the categories were chosen as a compromise between the 
_probable_ person who can fix things, and what we _actually_ know in the code.
   
   Part of the reason for this PR is to see if we can improve the categories. 
Persona is one: `USER`, `SECURITY_ADMIN`, `CLUSTER_ADMIN`, 
`YOUR_NETWORK_PROVIDER`, `THE_BOSS_WHO_APPROVES_RESOURCES`, etc. I'm being 
silly, but what are the right personas? How would a person in the code know 
which persona is the target?
   
   Any third solution that captures both ideas (persona and fits what we know 
in the code)?



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java:
##########
@@ -219,12 +226,14 @@ public Aggregation toDruidAggregation(
         );
         break;
       default:
-        throw new IAE(
-            "aggregation[%s], Invalid number of arguments[%,d] to [%s] 
operator",
-            aggregatorName,
-            args.size(),
-            aggregatorType.name()
-        );
+        throw DruidException.user(
+              "aggregation [%s], Invalid number of arguments %,d to [%s] 
operator",
+              aggregatorName,
+              args.size(),
+              aggregatorType.name()
+             )
+            .context(DruidException.ERROR_CODE, 
QueryException.PLAN_VALIDATION_FAILED_ERROR_CODE)

Review Comment:
   This is a hard one! Our error codes, at present, are somewhat random. We 
have multiple codes for some items, no codes for others. (E.g. 
`QUERY_UNSUPPORTED_ERROR_CODE` `SQL_QUERY_UNSUPPORTED_ERROR_CODE`. But nothing 
at all for "not found".). The current system didn't seem like something that 
can build upon.
   
   That said, it probably does make sense to add additional methods to 
`DruidException` for the obvious cases: `validationError()`, `unsupported()`, 
etc. for the planner errors.
   
   Also, after playing around with the code, it seems to make more sense for 
the error code to be a top-level exception attribute, not a context property.



##########
server/src/main/java/org/apache/druid/server/security/PreResponseAuthorizationCheckFilter.java:
##########
@@ -90,7 +89,7 @@ public void doFilter(ServletRequest servletRequest, 
ServletResponse servletRespo
       // (e.g. OverlordServletProxy), so this is not implemented for now.
       handleAuthorizationCheckError(
           StringUtils.format(
-              "Request did not have an authorization check performed, original 
response status[%s].",
+              "Request did not have an authorization check performed, original 
response status: %s",

Review Comment:
   See comment in the main discussion. This is a convention in _Druid_. It is 
not a convention elsewhere. We're mainly focused on end users here. Will your 
typical end user expect values in brackets? Or, should we use this format just 
for internal messages and log messages? Maybe omit for `USER` messages, include 
for all others. Something else?



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java:
##########
@@ -219,12 +226,14 @@ public Aggregation toDruidAggregation(
         );
         break;
       default:
-        throw new IAE(
-            "aggregation[%s], Invalid number of arguments[%,d] to [%s] 
operator",
-            aggregatorName,
-            args.size(),
-            aggregatorType.name()
-        );
+        throw DruidException.user(
+              "aggregation [%s], Invalid number of arguments %,d to [%s] 
operator",

Review Comment:
   Part of the goal of this project is to translate messages in "developerese" 
into English. Rewritten to:
   
   `"%s expects 1 or 2 arguments but found %d"`
   
   The notion is that the SQL user knows the operator which they wrote in the 
SQL. That operator takes 1 or 2 arguments. The user provided more, which is not 
valid. The internal aggregation name is not known to the SQL user, so the 
revised code omits that name.
   
   Example:
   
   > LATEST expects 1 or 2 arguments but found 3
   
   To touch on the raging brackets debate, would the above be clearer to an SQL 
user if it was formatted as:
   
   > [LATEST] expects 1 or 2 arguments but found [3]
   



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/EarliestLatestAnySqlAggregator.java:
##########
@@ -90,7 +90,9 @@ AggregatorFactory createAggregatorFactory(String name, String 
fieldName, String
           case COMPLEX:
             return new StringFirstAggregatorFactory(name, fieldName, 
timeColumn, maxStringBytes);
           default:
-            throw new UnsupportedSQLQueryException("EARLIEST aggregator is not 
supported for '%s' type", type);
+            throw DruidException.user("EARLIEST aggregator is not supported 
for '%s' type", type)

Review Comment:
   See discussion in main conversation. Once we decide on a style, I'll update 
all the revised error messages.



##########
sql/src/test/java/org/apache/druid/sql/calcite/CalciteNestedDataQueryTest.java:
##########
@@ -2692,9 +2692,12 @@ public void testGroupByInvalidPath()
         + "SUM(cnt) "
         + "FROM druid.nested GROUP BY 1",
         (expected) -> {
-          expected.expect(UnsupportedSQLQueryException.class);
+          expected.expect(DruidException.class);
           expected.expectMessage(
-              "Cannot use [JSON_VALUE_VARCHAR]: [Bad format, '.array.[1]' is 
not a valid JSONPath path: must start with '$']");
+              "Cannot use JSON_VALUE_VARCHAR\n" +
+              "Error Code: Unsupported operation\n" +
+              "Cause: Bad format, '.array.[1]' is not a valid JSONPath path: 
must start with '$'"

Review Comment:
   Reverted the change. We can focus our efforts where we have poor errors. 
This one seems fine as is.



##########
sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidPlanner.java:
##########
@@ -164,7 +167,10 @@ private SqlStatementHandler createHandler(final SqlNode 
node) throws ValidationE
     if (query.isA(SqlKind.QUERY)) {
       return new QueryHandler.SelectHandler(handlerContext, query, explain);
     }
-    throw new ValidationException(StringUtils.format("Cannot execute [%s].", 
node.getKind()));
+    throw DruidException.user("Unsupported SQL statement")
+        .context("Statement kind", node.getKind())
+        .context(DruidException.ERROR_CODE, 
QueryException.PLAN_VALIDATION_FAILED_ERROR_CODE)
+        .build();

Review Comment:
   This pass included some experiments to see what worked and what didn't. I 
agree that, in this case, using the context didn't work. Revised to:
   
   > Unsupported SQL statement %s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on a diff in pull request #13798: Introduce a DruidException for better user error messages

Reply via email to