GWphua opened a new pull request, #19016:
URL: https://github.com/apache/druid/pull/19016

   Fixes #9712 
   
   ### Description
   This PR adds support for minor compaction: the ability to compact a specific 
subset of segments within a time chunk rather than all segments in the 
interval. Previously, submitting a compaction task with `SpecificSegmentsSpec` 
(segment IDs) would cause Druid to lock, read, and rewrite all segments in the 
umbrella interval, defeating the purpose of targeting specific segments.
   
   The core problem spans multiple layers of the compaction pipeline:
   1. Locking: `CompactionTask#findSegmentsToLock` and all sub-task 
`findSegmentsToLock()` methods retrieve every segment in the umbrella interval 
via RetrieveUsedSegmentsAction, meaning the task acquires locks far broader 
than necessary.
   2. Input resolution: `NativeCompactionRunner#createIoConfig` always passes 
null for segmentIds to DruidInputSource, so the input source reads the full 
interval regardless of the input spec.
   3. Timeline lookup: `retrieveRelevantTimelineHolders()` uses 
`SegmentTimeline.lookup()` which requires ONLY_COMPLETE partitions... a 
filtered subset of segments appears incomplete and will be silently excluded.
   4. Validation: `CompactionTask.SegmentProvider#checkSegments` with 
`TIME_CHUNK` lock granularity delegates to 
`SpecificSegmentsSpec.validateSegments()` which requires an exact match between 
the spec's segments and all segments in the interval. This guarantees a failure 
when we give any proper subset of segments in the interval.
   
   ### Changes and Explanations
   
   #### dropExisting conflict guard
   A constructor-level validation in CompactionTask now rejects the combination 
of SpecificSegmentsSpec with dropExisting = true, since dropExisting semantics 
replace all segments in the interval — directly contradicting minor compaction 
intent.
   
   <hr>
   
   #### Segment filtering in lock acquisition
   `CompactionTask.findSegmentsToLock()` now filters the result of 
`RetrieveUsedSegmentsAction` to only the segment IDs present in 
`SpecificSegmentsSpec`. The same filtering is applied in `IndexTask`, 
`ParallelIndexSupervisorTask`, and `SinglePhaseSubTask` via 
`CTX_KEY_SPECIFIC_SEGMENTS_TO_COMPACT` propagated from 
`NativeCompactionRunner#createContextForSubtask()`. 
   
   This follows the existing pattern of passing compaction metadata through 
`CTX_KEY_APPENDERATOR_TRACKING_TASK_ID`.
   
   <hr>
   
   #### `CompactionTask.SegmentProvider` caches for TIME_CHUNK granularity in 
`checkSegments()`
   
   The intuition behind this approach is:
   1. `SegmentProvider#findSegments` is first being called, followed by 
`SegmentProvider#checkSegments`.
   2. When `findSegments` is called, we do not know which lock granularity is 
being used.
   3. Time granularity requires all segments in the interval, while segment 
granularity requires only the input segments
   4. Save all segments in interval in `findSegments` as 
`allSegmentsInInterval`, then later use this field when we encounter a 
TIME_CHUNK lock granularity.
   
   Honestly, I am not too satisfied with how I approached this problem, owing 
to the fact that developers now need to keep a temporal relationship between 
`findSegments` and `checkSegments`. Would love to hear about any alternatives 
to this problem!
   <hr>
   
   #### Segment-ID-based input for DruidInputSource
   `NativeCompactionRunner#createIoConfig` now detects `SpecificSegmentsSpec` 
and resolves the segment ID strings into `WindowedSegmentId` objects, passing 
them to `DruidInputSource` instead of the interval. 
   
   `DruidInputSource` already supports this code path, but it was never wired 
up from the compaction side.
   <hr>
   
   #### Timeline lookup with incomplete partitions
   `retrieveRelevantTimelineHolders()` now calls 
`lookupWithIncompletePartitions()` (i.e. Partitions.INCOMPLETE_OK) when the 
input spec is `SpecificSegmentsSpec`. 
   
   Without this, a filtered segment set that doesn't cover all partitions in 
the interval produces an empty timeline result and the compaction silently does 
nothing.
   <hr>
   
   #### MSQ engine rejection
   MSQ compaction is fundamentally incompatible with minor compaction: it 
forces dropExisting = true, uses REPLACE ingestion mode (which acquires 
TIME_CHUNK locks covering the full interval), and queries via 
MultipleIntervalSegmentSpec. A validation check is added in 
MSQCompactionRunner.validateCompactionTask() to reject SpecificSegmentsSpec 
with an explicit error message rather than failing in an opaque way downstream.
   
   #### Release note
   <!-- Give your best effort to summarize your changes in a couple of 
sentences aimed toward Druid users. 
   
   If your change doesn't have end user impact, you can skip this section.
   
   For tips about how to write a good release note, see [Release 
notes](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#release-notes).
   
   -->
   Compaction tasks using SpecificSegmentsSpec (segment ID list) now correctly 
compact only the specified segments instead of all segments in the umbrella 
interval. This new feature is unsupported in MSQ. 
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `CompactionTask `
    * `NativeCompactionRunner `
    * `IndexTask `
    * `ParallelIndexSupervisorTask`
    *  `ParallelIndexSupervisorTask`
    *  `SinglePhaseSubTask`
    *  `MSQCompactionRunner`
    *  `CompactionTaskTest` / `TaskLockHelperTest`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   
   - [x] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [x] a release note entry in the PR description.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to