maytasm opened a new pull request #12062:
URL: https://github.com/apache/druid/pull/12062


   Support overlapping segment intervals in auto compaction
   
   ### Description
   
   This PR fixes two problems when Druid compact overlapping segment intervals 
via auto compaction.
   Imagine we have a segment with interval 
2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z (MONTH segmentGranularity) 
and another segment with interval 
2016-06-27T00:00:00.000Z/2016-07-04T00:00:00.000Z (WEEK segmentGranularity). 
   - The first problem is that auto compaction's `CompactionSegmentIterator` 
only return segment from a single time chunk bucket. For example, 
NewestSegmentFirstIterator would return the interval 
2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z and submit a compaction task 
with the interval 2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z. However, 
the segment return from the iterator would only contains the MONTH segment and 
hence the sha256OfSortedSegmentIds calculated by auto compaction only contains 
the MONTH segment (2016-07-01T00:00:00.000Z/2016-08-01T00:00:00.000Z). This 
causes compaction task to fail when it starts running as the task would get all 
segments marked as used in the interval, which would be both the WEEK segment 
and MONTH segment, then compute the sha256 and compare it with the sha256 in 
the compaction spec. The sha256 would be different as the compaction task's 
sha256 only contains the MONTH segment. This issue is fixed by removing the 
sha256OfSortedSegmentIds
  from the compaction task spec created by auto compaction. 
sha256OfSortedSegmentIds was added in https://github.com/apache/druid/pull/8571 
to enforce a limit on the number of segments in one compaction task. However, 
this is no longer necessary as compaction task can use parallel ingestion task. 
   - The second issue arises when we do not set segmentGranularity in auto 
compaction config. When segmentGranularity in auto compaction config is not 
set, then the compaction task will determines the segmentGranularity from the 
segments marked used in the compaction task interval. In this case, the union 
interval of the used segments will be 
2016-06-27T00:00:00.000Z/2016-08-01T00:00:00.000Z which will result in an 
invalid granularity. Compaction task should follow the same segmentGranularity 
as the bucketing in auto compaction's `CompactionSegmentIterator`. To fix this 
issue, the segmentGranularity to be used in compaction task is determined in 
auto compaction based on the segments returned by auto compaction's 
`CompactionSegmentIterator`, thus ensuring that we preserve the same 
bucketing/chunking of segments. 
   
   This PR has:
   - [x] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] added integration tests.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to