paul-rogers commented on code in PR #13815:
URL: https://github.com/apache/druid/pull/13815#discussion_r1110173527
##########
core/src/main/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequence.java:
##########
@@ -191,7 +191,8 @@ public boolean hasNext()
{
final long thisTimeoutNanos = timeoutAtNanos -
System.nanoTime();
if (hasTimeout && thisTimeoutNanos < 0) {
- throw new QueryTimeoutException("Sequence iterator timed
out");
+ throw new QueryTimeoutException("Query did not complete
within configured timeout period. " +
+ "You can increase query timeout or tune the performance
of query");
Review Comment:
Suggestion:
> Query did not complete within the timeout period set by druid.some.config.
The action suggested here is OK, but it does not tell the user how to change
the config. Also, there could be other causes: network issues, machine is
overloaded, too many threads, and on and on.
##########
core/src/main/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequence.java:
##########
@@ -206,7 +207,8 @@ public boolean hasNext()
}
}
if (currentBatch == null) {
- throw new QueryTimeoutException("Sequence iterator timed
out waiting for data");
+ throw new QueryTimeoutException("Query did not complete
within configured timeout period. " +
Review Comment:
Suggestion: if the error appears more than once, declare a constant and
reference the constant. That also makes it easier or tests to track the message
as it evolves.
##########
docs/operations/metrics.md:
##########
@@ -221,7 +221,7 @@ batch ingestion emit the following metrics. These metrics
are deltas for each em
|`ingest/merge/cpu`|Cpu time in Nanoseconds spent on merging intermediate
segments.|`dataSource`, `taskId`, `taskType`, `tags`|Depends on configuration.
Generally a few minutes at most.|
|`ingest/handoff/count`|Number of handoffs that happened.|`dataSource`,
`taskId`, `taskType`, `tags`|Varies. Generally greater than 0 once every
segment granular period if cluster operating normally.|
|`ingest/sink/count`|Number of sinks not handoffed.|`dataSource`, `taskId`,
`taskType`, `tags`|1~3|
-|`ingest/events/messageGap`|Time gap in milliseconds between the latest
ingested event timestamp and the current system timestamp of metrics emission.
|`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time
carried in event. |
+|`ingest/events/messageGap`|Time gap in milliseconds between the latest
ingested event timestamp and the current system timestamp of metrics emission.
If this is increasing but lag is low, it likely means that druid is not getting
new data. This metric will get reset as new tasks spawn up.|`dataSource`,
`taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in
event. |
Review Comment:
"This metric gets reset" --> "This metric is reset"
##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1605,7 +1599,13 @@ protected DataStoreMetadataUpdateResult
updateDataSourceMetadataWithHandle(
.bind("commit_metadata_sha1",
newCommitMetadataSha1)
.execute();
- retVal = numRows == 1 ? DataStoreMetadataUpdateResult.SUCCESS :
DataStoreMetadataUpdateResult.TRY_AGAIN;
+ retVal = numRows == 1
+ ? DataStoreMetadataUpdateResult.SUCCESS
+ : new DataStoreMetadataUpdateResult(
+ true,
+ true,
+ "Failed to insert metadata for datasource [%s]",
Review Comment:
This error is not very actionable for newbies: it is helpful only to
developers already familiar with the area. If we can't insert metadata, then
we've probably got a fatal or data loss error, don't we?
Perhaps:
> Fatal error: Supervisor %s failed to update metadata for datasource %s in
table %s. Newly ingested segments are lost.
Or some such.
##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1908,4 +1916,71 @@ public String getSequenceName()
}
}
+ public static class DataStoreMetadataUpdateResult
+ {
+ private final boolean failed;
+ private final boolean canRetry;
+ @Nullable private final String errorMsg;
Review Comment:
This class is a good idea, but we're jumping to error messages too soon.
This class hold the exception, not the error message. The exception class
conveys information. The exception has a stack trace. With the message,
developers have no information about what specific bit of code went wrong when
debugging. Convert the exception to a string at the point the string is needed.
##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1679,11 +1685,13 @@ protected DataStoreMetadataUpdateResult
dropSegmentsWithHandle(
);
if (numChangedSegments != segmentsToDrop.size()) {
- log.warn("Failed to drop segments metadata update as
numChangedSegments[%s] segmentsToDropSize[%s]",
- numChangedSegments,
- segmentsToDrop.size()
+ return new DataStoreMetadataUpdateResult(
+ true,
+ true,
+ "Failed to drop some segments. Only [%s] could be dropped out of
[%s]. Trying again",
Review Comment:
Here, the value is a number. We don't need brackets or `%s`.
> Only %d out of %d were dropped.
##########
docs/operations/metrics.md:
##########
@@ -221,7 +221,7 @@ batch ingestion emit the following metrics. These metrics
are deltas for each em
|`ingest/merge/cpu`|Cpu time in Nanoseconds spent on merging intermediate
segments.|`dataSource`, `taskId`, `taskType`, `tags`|Depends on configuration.
Generally a few minutes at most.|
|`ingest/handoff/count`|Number of handoffs that happened.|`dataSource`,
`taskId`, `taskType`, `tags`|Varies. Generally greater than 0 once every
segment granular period if cluster operating normally.|
|`ingest/sink/count`|Number of sinks not handoffed.|`dataSource`, `taskId`,
`taskType`, `tags`|1~3|
-|`ingest/events/messageGap`|Time gap in milliseconds between the latest
ingested event timestamp and the current system timestamp of metrics emission.
|`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time
carried in event. |
+|`ingest/events/messageGap`|Time gap in milliseconds between the latest
ingested event timestamp and the current system timestamp of metrics emission.
If this is increasing but lag is low, it likely means that druid is not getting
new data. This metric will get reset as new tasks spawn up.|`dataSource`,
`taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in
event. |
Review Comment:
What does this mean? "the current system timestamp of metrics emission."
This part is clear: ""the current system timestamp". It is not clear how the
"system" (meaning OS) knows about metrics emission. I suspect we just mean `now
- __time`. Correct? If so, perhaps reword.
> Druid computes the lag between the current system timestamp and the
`__time` value for each record. Druid emits that value, for each record, as
this metric. If the value...
##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1572,12 +1566,12 @@ protected DataStoreMetadataUpdateResult
updateDataSourceMetadataWithHandle(
if (!startMetadataMatchesExisting) {
// Not in the desired start state.
- log.error(
- "Not updating metadata, existing state[%s] in metadata store doesn't
match to the new start state[%s].",
+ return new DataStoreMetadataUpdateResult(true, false, StringUtils.format(
+ "Not updating metadata, existing state[%s] in metadata store doesn't
match to the new start state[%s]." +
+ " This usually happens if the input stream is changed within the
same supervisor.",
Review Comment:
As a newbie to this area, the message is a bit of a mystery. Are we saying
that we're using optimistic locking and we just tried to do an update where we
expect version X, but found a newer version Y? Or, that the DB transaction
failed (perhaps due to locking conflicts) and we discarded the update?
Something else?
Also, how does an "input stream change" within "the same supervisor"? Are we
saying that the _contents_ of the data in Kafka (say) changed? Or, that the
offsets registered in Kafka changed? Or, did I run two supervisors against the
same topic?
We also don't say which state we're changing. The state of the supervisor
task? Which one?
Is this error a bad thing? Will I lose data? or, will the system continue on
and I can ignore the messages? Or, in fact, is this an assertion and, in a
properly coded Druid, will never occur?
So, perhaps:
> Attempted to update state for supervisor %s in table %s from %s to %s, but
found the state was already at %s. Are you running two supervisors for the same
topic?
Don't take the above too literally: I'm making up much of the information.
##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1605,7 +1599,13 @@ protected DataStoreMetadataUpdateResult
updateDataSourceMetadataWithHandle(
.bind("commit_metadata_sha1",
newCommitMetadataSha1)
.execute();
- retVal = numRows == 1 ? DataStoreMetadataUpdateResult.SUCCESS :
DataStoreMetadataUpdateResult.TRY_AGAIN;
+ retVal = numRows == 1
+ ? DataStoreMetadataUpdateResult.SUCCESS
+ : new DataStoreMetadataUpdateResult(
+ true,
+ true,
+ "Failed to insert metadata for datasource [%s]",
Review Comment:
Nit: the `word[%s]` pattern is a bit awkward. It may have started from
thread names: `datasource[%s]`. But, for error messages, a better approach is:
* If the `%s` value is a single symbol (no spaces), omit the brackets.
* If the `%s` value contains spaces, include brackets, but insert a space
before the bracket.
Example: `datasource %s not found`, `The name [%s] is not valid`
In the first case, the rules for datasource names disallow spaces. In the
second, the name is invalid, so it may well be "foo bar" (with spaces) so we
need the brackets.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]