[GitHub] [druid] capistrant commented on a change in pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

GitBox Wed, 27 Jan 2021 07:17:27 -0800


capistrant commented on a change in pull request #10676:
URL: https://github.com/apache/druid/pull/10676#discussion_r565391843




##########
File path: 
indexing-service/src/main/java/org/apache/druid/indexing/common/IngestionStatsAndErrorsTaskReportData.java
##########
@@ -41,17 +41,22 @@
   @Nullable
   private String errorMsg;
 
+  @JsonProperty
+  private boolean segmentAvailabilityConfirmed;
+
   public IngestionStatsAndErrorsTaskReportData(
       @JsonProperty("ingestionState") IngestionState ingestionState,
       @JsonProperty("unparseableEvents") Map<String, Object> unparseableEvents,
       @JsonProperty("rowStats") Map<String, Object> rowStats,
-      @JsonProperty("errorMsg") @Nullable String errorMsg
+      @JsonProperty("errorMsg") @Nullable String errorMsg,
+      @JsonProperty("segmentAvailabilityConfirmed") boolean 
segmentAvailabilityConfirmed

Review comment:
       > > my company deploys a large multi-tenant cluster with a services 
layer for ingestion that our tenants use. these tenants don't just want to know 
when their task succeeds, they also want to know when data from batch ingest is 
available for querying. This solution allows us to prevent the ingestion 
services layer and/or individual tenants from banging on Druid APIs trying to 
see if their data is available after ingestion.
   > 
   > I understand this, but my question is more like what people expect when 
segment handoff fails. In streaming ingestion, the handoff failure causes task 
failure (this behavior seems arguable, but that's what it does now) and thus 
people's expectation is that they could see some data dropped after handoff 
failures until new tasks read the same data and publishes the same segments 
again. However, since there is no realtime querying in batch ingestion, I don't 
think tasks should fail on handoff failures (which is what this PR does! 🙂), 
but then what will be people's expectation? Are they going to be just OK with 
handoff failures and wait indefinitely until historicals load new segments (the 
current behavior)? Do they want to know why the handoff failed? Do they want to 
know how long it took before the handoff failed? These questions are not clear 
to me.
   
   good question. for my specific case the service that end users interact with 
really wanted to be able to answer this question for the end user:
   * Is the data that I ingested in this job completely loaded for querying?
   
   For us a simple yes/no will suffice. The cluster operators would have the 
goal of having 100% of jobs successfully handoff data before the timeout, but 
when that doesn't happen our users simply want to know that they may need to 
wait longer. We are simply trying to be transparent and report the point in 
time status. The onus of finding out when the data is fully loaded if this 
timeout expired before loading, would fall on a different solution (TBD).
   
   You're right, we intentionally did not fail these tasks because Historical 
nodes loading the segments is detached from whether or not the data was written 
to deepstore/metastore (if that failed the task should and likely would fail 
due to existing code paths). We don't want our end users thinking they need to 
re-run their jobs when this is much more likely to be an issue of the 
coordinator not having assigned servers to load segments by the time the 
timeout expired.
   
   Why the handoff failed would be something I as an operator am more 
interested compared to a user (unless that user is also an operator). I think 
that would be very difficult to communicate in these reports since the indexing 
task doesn't know much about what the rest of the cluster is doing.
   
   Knowing how long it took before the time out could be found in the spec, but 
I guess it could be useful to add that value to the report as well if you think 
users would want to have quick reference. I think that rather than having that 
static value, it could be cool to have the dynamic time waited for handoff. 
Maybe it is the static value because we hit the timeout. but as an operator I 
would enjoy seeing how long each successful job waited for handoff. what do you 
think about that?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] capistrant commented on a change in pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

Reply via email to