[GitHub] [druid] capistrant opened a new pull request #10622: Add tuningConfigs to IndexTask and HadoopIndexTask for limiting intervals and shards created in one job

GitBox Wed, 02 Dec 2020 10:37:26 -0800


capistrant opened a new pull request #10622:
URL: https://github.com/apache/druid/pull/10622

<!-- If you are a committer, follow the PR action item checklist for
committers:

https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers.
-->

### Description

# Background

Some batch ingestion jobs dynamically identify the intervals being ingested
as well as the sharding within those intervals. This dynamic discovery is very
user friendly. However, we have found, at my company, that on our multi-tenant
cluster where many users submit ingestion jobs through a managed service that
creates and submits ingestion specs, users can mistakenly start jobs that index
more than they or we would like them to index in one job. For instance, a user
may try to start a hadoop batch job that indexes a year of raw source data at
hourly granularity. This generates up to 365 * 24 = 8,760 intervals that may
have even more sharding within intervals. To combat this, we have decided to
limit the number of segment intervals that a single HadoopIndexTask or
IndexTask can create. We also limit the aggregate number of shards across the
whole ingestion job when possible too. Doing so has allowed us to improve
quality of service to our many tenants. Now we have decided to explor
e upstreaming a similar implementation of our tooling. Our thought is that we
will open this PR and gauge the community interest in such a feature. I think
others who run multi tenant clusters could benefit from this if merged.

# Description of feature

These new configs are limited to only IndexTask (non-parallel) and
HadoopIndexTask. While we have not fully explored implementation for
ParallelIndexTask, it appears that it may be difficult or impossible to find a
way to cleanly identify when and how to stop tasks if they hit limits.

The tuning config for the applicable tasks adds two new configurations.
* `maxSegmentIntervalsPermitted`: The number of segment intervals that a
single job can identify for ingest dynamically
* `maxAggregateSegmentIntervalShardsPermitted`: The aggregate number of
shards across all intervals a job can create if sharding is discovered
dynamically before ingestion.

It is important to note that these limits are only applied when the
information is obtained at runtime by the indexing job. For the segment
intervals, we only ever enforce the limit if the spec has `null` intervals.

For the aggregate sharding, we only ever enforce the limit if we run a
determine partitions phase where we scan the data to determine bucket counts
for each interval.

Therefore, it is assumed that if the user is supplying the information up
front for intervals and sharding, they are well aware of the scope of their
ingest and we should not interfere with that.

<hr>

This PR has:
- [ ] been self-reviewed.
- [ ] added documentation for new or modified features or behaviors.
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.

<hr>

##### Key changed/added classes in this PR
* HadoopTuningConfig
* HadoopIndexTask
* IndexTask
* TuningConfig

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] capistrant opened a new pull request #10622: Add tuningConfigs to IndexTask and HadoopIndexTask for limiting intervals and shards created in one job

Reply via email to