maytasm opened a new pull request #12062: URL: https://github.com/apache/druid/pull/12062
Support overlapping segment intervals in auto compaction ### Description This PR fixes two problems when Druid compact overlapping segment intervals via auto compaction. Imagine we have a segment with interval 2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z (MONTH segmentGranularity) and another segment with interval 2016-06-27T00:00:00.000Z/2016-07-04T00:00:00.000Z (WEEK segmentGranularity). - The first problem is that auto compaction's `CompactionSegmentIterator` only return segment from a single time chunk bucket. For example, NewestSegmentFirstIterator would return the interval 2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z and submit a compaction task with the interval 2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z. However, the segment return from the iterator would only contains the MONTH segment and hence the sha256OfSortedSegmentIds calculated by auto compaction only contains the MONTH segment (2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z). This causes compaction task to fail when it starts running as the task would get all segments marked as used in the interval, which would be both the WEEK segment and MONTH segment, then compute the sha256 and compare it with the sha256 in the compaction spec. The sha256 would be different as the compaction task's sha256 only contains the MONTH segment. This issue is fixed by removing the sha256OfSortedSegmentIds from the compaction task spec created by auto compaction. sha256OfSortedSegmentIds was added in https://github.com/apache/druid/pull/8571 to enforce a limit on the number of segments in one compaction task. However, this is no longer necessary as compaction task can use parallel ingestion task. - The second issue arises when we do not set segmentGranularity in auto compaction config. When segmentGranularity in auto compaction config is not set, then the compaction task will determines the segmentGranularity from the segments marked used in the compaction task interval. In this case, the union interval of the used segments will be 2016-06-27T00:00:00.000Z/2016-08-01T00:00:00.000Z which will result in an invalid granularity. Compaction task should follow the same segmentGranularity as the bucketing in auto compaction's `CompactionSegmentIterator`. To fix this issue, the segmentGranularity to be used in compaction task is determined in auto compaction based on the segments returned by auto compaction's `CompactionSegmentIterator`, thus ensuring that we preserve the same bucketing/chunking of segments. This PR has: - [x] been self-reviewed. - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.) - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md) - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [x] added integration tests. - [x] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
