[GitHub] [druid] paul-rogers commented on a diff in pull request #13815: Improved error message when topic name changes within same supervisor

via GitHub Fri, 17 Feb 2023 10:37:23 -0800


paul-rogers commented on code in PR #13815:
URL: https://github.com/apache/druid/pull/13815#discussion_r1110173527



##########
core/src/main/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequence.java:
##########
@@ -191,7 +191,8 @@ public boolean hasNext()
               {
                 final long thisTimeoutNanos = timeoutAtNanos - 
System.nanoTime();
                 if (hasTimeout && thisTimeoutNanos < 0) {
-                  throw new QueryTimeoutException("Sequence iterator timed 
out");
+                  throw new QueryTimeoutException("Query did not complete 
within configured timeout period. " +
+                      "You can increase query timeout or tune the performance 
of query");

Review Comment:
   Suggestion:
   
   > Query did not complete within the timeout period set by druid.some.config.
   
   The action suggested here is OK, but it does not tell the user how to change 
the config. Also, there could be other causes: network issues, machine is 
overloaded, too many threads, and on and on.



##########
core/src/main/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequence.java:
##########
@@ -206,7 +207,8 @@ public boolean hasNext()
                     }
                   }
                   if (currentBatch == null) {
-                    throw new QueryTimeoutException("Sequence iterator timed 
out waiting for data");
+                    throw new QueryTimeoutException("Query did not complete 
within configured timeout period. " +

Review Comment:
   Suggestion: if the error appears more than once, declare a constant and 
reference the constant. That also makes it easier or tests to track the message 
as it evolves.



##########
docs/operations/metrics.md:
##########
@@ -221,7 +221,7 @@ batch ingestion emit the following metrics. These metrics 
are deltas for each em
 |`ingest/merge/cpu`|Cpu time in Nanoseconds spent on merging intermediate 
segments.|`dataSource`, `taskId`, `taskType`, `tags`|Depends on configuration. 
Generally a few minutes at most.|
 |`ingest/handoff/count`|Number of handoffs that happened.|`dataSource`, 
`taskId`, `taskType`, `tags`|Varies. Generally greater than 0 once every 
segment granular period if cluster operating normally.|
 |`ingest/sink/count`|Number of sinks not handoffed.|`dataSource`, `taskId`, 
`taskType`, `tags`|1~3|
-|`ingest/events/messageGap`|Time gap in milliseconds between the latest 
ingested event timestamp and the current system timestamp of metrics emission. 
|`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time 
carried in event. |
+|`ingest/events/messageGap`|Time gap in milliseconds between the latest 
ingested event timestamp and the current system timestamp of metrics emission. 
If this is increasing but lag is low, it likely means that druid is not getting 
new data. This metric will get reset as new tasks spawn up.|`dataSource`, 
`taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in 
event. |

Review Comment:
   "This metric gets reset" --> "This metric is reset"



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1605,7 +1599,13 @@ protected DataStoreMetadataUpdateResult 
updateDataSourceMetadataWithHandle(
                                 .bind("commit_metadata_sha1", 
newCommitMetadataSha1)
                                 .execute();
 
-      retVal = numRows == 1 ? DataStoreMetadataUpdateResult.SUCCESS : 
DataStoreMetadataUpdateResult.TRY_AGAIN;
+      retVal = numRows == 1
+          ? DataStoreMetadataUpdateResult.SUCCESS
+          : new DataStoreMetadataUpdateResult(
+              true,
+          true,
+          "Failed to insert metadata for datasource [%s]",

Review Comment:
   This error is not very actionable for newbies: it is helpful only to 
developers already familiar with the area. If we can't insert metadata, then 
we've probably got a fatal or data loss error, don't we?
   
   Perhaps:
   
   > Fatal error: Supervisor %s failed to update metadata for datasource %s in 
table %s. Newly ingested segments are lost.
   
   Or some such.



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1908,4 +1916,71 @@ public String getSequenceName()
     }
   }
 
+  public static class DataStoreMetadataUpdateResult
+  {
+    private final boolean failed;
+    private final boolean canRetry;
+    @Nullable private final String errorMsg;

Review Comment:
   This class is a good idea, but we're jumping to error messages too soon. 
This class hold the exception, not the error message. The exception class 
conveys information. The exception has a stack trace. With the message, 
developers have no information about what specific bit of code went wrong when 
debugging. Convert the exception to a string at the point the string is needed.



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1679,11 +1685,13 @@ protected DataStoreMetadataUpdateResult 
dropSegmentsWithHandle(
         );
 
     if (numChangedSegments != segmentsToDrop.size()) {
-      log.warn("Failed to drop segments metadata update as 
numChangedSegments[%s] segmentsToDropSize[%s]",
-               numChangedSegments,
-               segmentsToDrop.size()
+      return new DataStoreMetadataUpdateResult(
+          true,
+          true,
+          "Failed to drop some segments. Only [%s] could be dropped out of 
[%s]. Trying again",

Review Comment:
   Here, the value is a number. We don't need brackets or `%s`.
   
   > Only %d out of %d were dropped.



##########
docs/operations/metrics.md:
##########
@@ -221,7 +221,7 @@ batch ingestion emit the following metrics. These metrics 
are deltas for each em
 |`ingest/merge/cpu`|Cpu time in Nanoseconds spent on merging intermediate 
segments.|`dataSource`, `taskId`, `taskType`, `tags`|Depends on configuration. 
Generally a few minutes at most.|
 |`ingest/handoff/count`|Number of handoffs that happened.|`dataSource`, 
`taskId`, `taskType`, `tags`|Varies. Generally greater than 0 once every 
segment granular period if cluster operating normally.|
 |`ingest/sink/count`|Number of sinks not handoffed.|`dataSource`, `taskId`, 
`taskType`, `tags`|1~3|
-|`ingest/events/messageGap`|Time gap in milliseconds between the latest 
ingested event timestamp and the current system timestamp of metrics emission. 
|`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time 
carried in event. |
+|`ingest/events/messageGap`|Time gap in milliseconds between the latest 
ingested event timestamp and the current system timestamp of metrics emission. 
If this is increasing but lag is low, it likely means that druid is not getting 
new data. This metric will get reset as new tasks spawn up.|`dataSource`, 
`taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in 
event. |

Review Comment:
   What does this mean? "the current system timestamp of metrics emission." 
This part is clear: ""the current system timestamp". It is not clear how the 
"system" (meaning OS) knows about metrics emission. I suspect we just mean `now 
- __time`. Correct? If so, perhaps reword.
   
   > Druid computes the lag between the current system timestamp and the 
`__time` value for each record. Druid emits that value, for each record, as 
this metric. If the value...



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1572,12 +1566,12 @@ protected DataStoreMetadataUpdateResult 
updateDataSourceMetadataWithHandle(
 
     if (!startMetadataMatchesExisting) {
       // Not in the desired start state.
-      log.error(
-          "Not updating metadata, existing state[%s] in metadata store doesn't 
match to the new start state[%s].",
+      return new DataStoreMetadataUpdateResult(true, false, StringUtils.format(
+          "Not updating metadata, existing state[%s] in metadata store doesn't 
match to the new start state[%s]." +
+              " This usually happens if the input stream is changed within the 
same supervisor.",

Review Comment:
   As a newbie to this area, the message is a bit of a mystery. Are we saying 
that we're using optimistic locking and we just tried to do an update where we 
expect version X, but found a newer version Y? Or, that the DB transaction 
failed (perhaps due to locking conflicts) and we discarded the update? 
Something else?
   
   Also, how does an "input stream change" within "the same supervisor"? Are we 
saying that the _contents_ of the data in Kafka (say) changed? Or, that the 
offsets registered in Kafka changed? Or, did I run two supervisors against the 
same topic?
   
   We also don't say which state we're changing. The state of the supervisor 
task? Which one?
   
   Is this error a bad thing? Will I lose data? or, will the system continue on 
and I can ignore the messages? Or, in fact, is this an assertion and, in a 
properly coded Druid, will never occur?
   
   So, perhaps:
   
   > Attempted to update state for supervisor %s in table %s from %s to %s, but 
found the state was already at %s. Are you running two supervisors for the same 
topic?
   
   Don't take the above too literally: I'm making up much of the information.



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1605,7 +1599,13 @@ protected DataStoreMetadataUpdateResult 
updateDataSourceMetadataWithHandle(
                                 .bind("commit_metadata_sha1", 
newCommitMetadataSha1)
                                 .execute();
 
-      retVal = numRows == 1 ? DataStoreMetadataUpdateResult.SUCCESS : 
DataStoreMetadataUpdateResult.TRY_AGAIN;
+      retVal = numRows == 1
+          ? DataStoreMetadataUpdateResult.SUCCESS
+          : new DataStoreMetadataUpdateResult(
+              true,
+          true,
+          "Failed to insert metadata for datasource [%s]",

Review Comment:
   Nit: the `word[%s]` pattern is a bit awkward. It may have started from 
thread names: `datasource[%s]`. But, for error messages, a better approach is:
   
   * If the `%s` value is a single symbol (no spaces), omit the brackets.
   * If the `%s` value contains spaces, include brackets, but insert a space 
before the bracket.
   
   Example: `datasource %s not found`, `The name [%s] is not valid`
   
   In the first case, the rules for datasource names disallow spaces. In the 
second, the name is invalid, so it may well be "foo bar" (with spaces) so we 
need the brackets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on a diff in pull request #13815: Improved error message when topic name changes within same supervisor

Reply via email to