gianm opened a new issue, #19117:
URL: https://github.com/apache/druid/issues/19117

   Since #18950, native compaction tasks can OOM due to excessive memory usage 
in `CachingLocalSegmentAllocator` when the compaction spec uses hash 
partitioning with a very wide interval, like `0000/9999`. I think this will 
only happen with manual compaction, since with autocompaction, the interval is 
generally targeted to the segments we actually want to compact.
   
   The issue stems from a change in logic in 
`CompactionTask#createDataSchemasForIntervals`. Prior to the PR, the interval 
for the `DataSchema` was determined based solely on the segments actually being 
processed. Afterwards, it's extended to the umbrella interval of those segments 
and the user-provided job interval. The relevant change is at: 
https://github.com/apache/druid/pull/18950/changes#diff-ddc42134e9e789d2b0af5db8c313abd3026f99a919ae26b6137c663f6f2d5228L638-L643
   
   As an effect of this change, the `HashPartitionAnalysis` will contain an 
entry for every day in the user-supplied interval. That in turn leads the 
`CachingLocalSegmentAllocator` to generate buckets and shard specs for all 
possible days.
   
   I don't think the MSQ runner would have a problem like this, because the 
equivalent MSQ statement (like the following SQL) would not attempt to create 
buckets for all 3.6 million days. It would only process day buckets that 
actually contained data. Then it would check to see if any tombstones need to 
be created, and created any needed ones. Critically, it will only create 
tombstones for day buckets that actually have both (1) some input data and (2) 
no output data. This happens in `ControllerImpl#findIntervalsToDrop`.
   
   ```
   REPLACE INTO tbl
   OVERWRITE WHERE __time >= TIMESTAMP '0001-01-01 00:00:00' AND __time < 
TIMESTAMP '9999-01-01 00:00:00'
   SELECT * FROM tbl
   PARTITIONED BY DAY
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to