capistrant commented on a change in pull request #10676:
URL: https://github.com/apache/druid/pull/10676#discussion_r565391843
##########
File path:
indexing-service/src/main/java/org/apache/druid/indexing/common/IngestionStatsAndErrorsTaskReportData.java
##########
@@ -41,17 +41,22 @@
@Nullable
private String errorMsg;
+ @JsonProperty
+ private boolean segmentAvailabilityConfirmed;
+
public IngestionStatsAndErrorsTaskReportData(
@JsonProperty("ingestionState") IngestionState ingestionState,
@JsonProperty("unparseableEvents") Map<String, Object> unparseableEvents,
@JsonProperty("rowStats") Map<String, Object> rowStats,
- @JsonProperty("errorMsg") @Nullable String errorMsg
+ @JsonProperty("errorMsg") @Nullable String errorMsg,
+ @JsonProperty("segmentAvailabilityConfirmed") boolean
segmentAvailabilityConfirmed
Review comment:
> > my company deploys a large multi-tenant cluster with a services
layer for ingestion that our tenants use. these tenants don't just want to know
when their task succeeds, they also want to know when data from batch ingest is
available for querying. This solution allows us to prevent the ingestion
services layer and/or individual tenants from banging on Druid APIs trying to
see if their data is available after ingestion.
>
> I understand this, but my question is more like what people expect when
segment handoff fails. In streaming ingestion, the handoff failure causes task
failure (this behavior seems arguable, but that's what it does now) and thus
people's expectation is that they could see some data dropped after handoff
failures until new tasks read the same data and publishes the same segments
again. However, since there is no realtime querying in batch ingestion, I don't
think tasks should fail on handoff failures (which is what this PR does! 🙂),
but then what will be people's expectation? Are they going to be just OK with
handoff failures and wait indefinitely until historicals load new segments (the
current behavior)? Do they want to know why the handoff failed? Do they want to
know how long it took before the handoff failed? These questions are not clear
to me.
good question. for my specific case the service that end users interact with
really wanted to be able to answer this question for the end user:
* Is the data that I ingested in this job completely loaded for querying?
For us a simple yes/no will suffice. The cluster operators would have the
goal of having 100% of jobs successfully handoff data before the timeout, but
when that doesn't happen our users simply want to know that they may need to
wait longer. We are simply trying to be transparent and report the point in
time status. The onus of finding out when the data is fully loaded if this
timeout expired before loading, would fall on a different solution (TBD).
You're right, we intentionally did not fail these tasks because Historical
nodes loading the segments is detached from whether or not the data was written
to deepstore/metastore (if that failed the task should and likely would fail
due to existing code paths). We don't want our end users thinking they need to
re-run their jobs when this is much more likely to be an issue of the
coordinator not having assigned servers to load segments by the time the
timeout expired.
Why the handoff failed would be something I as an operator am more
interested compared to a user (unless that user is also an operator). I think
that would be very difficult to communicate in these reports since the indexing
task doesn't know much about what the rest of the cluster is doing.
Knowing how long it took before the time out could be found in the spec, but
I guess it could be useful to add that value to the report as well if you think
users would want to have quick reference. I think that rather than having that
static value, it could be cool to have the dynamic time waited for handoff.
Maybe it is the static value because we hit the timeout. but as an operator I
would enjoy seeing how long each successful job waited for handoff. what do you
think about that?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]