[GitHub] [druid] capistrant opened a new pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

GitBox Mon, 14 Dec 2020 08:05:31 -0800


capistrant opened a new pull request #10676:
URL: https://github.com/apache/druid/pull/10676

<!-- If you are a committer, follow the PR action item checklist for
committers:

https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers.
-->

### Description

#### High Level Description

Add configuration in `tuningConfig` for end user to tell Indexing Service to
wait for segments to become available for query before completing the indexing
task. The configuration is a timeout value in milliseconds to prevent waiting
forever. If the timeout expires, the task still succeeds, but the task reports
will indicate that Druid was not able to confirm that the segments became
available for query.

This new configuration stems from my experience operating a production
cluster with many tenants who often have the same complaint: "My indexing job
is complete but the latest data is not available right when the job finishes".
This addresses that by letting the client set a reasonable timeout. After the
job completes, they can parse the ingestion report and see if their segments
became available. More often than not, with a reasonable timeout, their
segments will indeed be available right when the job completes.

#### Implementation

A lot of the code is already written for realtime handoffs. I extracted that
code out of the realtime packages into a Java package so it is less confusing
as to why non-realtime tasks are using it. `org.apache.druid.segment.handoff`
is a new package in `druid-server` module.

`AbstractBatchIndexTask` gets a new method,
`waitForSegmentAvailability(TaskToolbox toolbox, ExecutorService exec,
List<DataSegment> segmentsToWaitFor, long waitTimeout)` that handles the
waiting. Batch Indexing implementations leverage this method at the end of
their ingestion task code if the client's tuningConfig has a non-zero wait time
for segment availability. Default is to not wait.

A new key:value pair is added to the IngestionStatsAndErrorsTaskReport
`segmentAvailabilityConfirmed`. This is a boolean that indicates if the job was
able to confirm query availability of the new segments before finishing.
Parallel index task supervisor did not previously have this report, so this PR
adds the report with the needed availability key:value pair so all of simple
native, parallel native, and hadoop native can implement this availability wait.

#### Alternatives

https://github.com/apache/druid/releases#19-datasource-loadstatus became
available in druid 0.20.0. However, I worry about giving ingestion clients the
green light to hit this API endpoint due to the possible expense of the calls
depending on the questions asked.

<hr>

This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

<hr>

##### Key changed/added classes in this PR
* AbstractBatchIndexTask
* HadoopIndexTask
* IndexTask
* ParallelIndexSupervisorTask
* HadoopTuningConfig
* ParallelIndexTuningConfig
* IngestionStatsAndErrorsTaskReportData
* AbstractITBatchIndexTest
* ITHadoopIndexTest
* ITBestEffortRollupParallelIndexTest
* ITIndexerTest

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] capistrant opened a new pull request #10676: Allow client to configure batch ingestion task to wait to complete until segments are confirmed to be available by other

Reply via email to