gianm opened a new issue, #19117: URL: https://github.com/apache/druid/issues/19117
Since #18950, native compaction tasks can OOM due to excessive memory usage in `CachingLocalSegmentAllocator` when the compaction spec uses hash partitioning with a very wide interval, like `0000/9999`. I think this will only happen with manual compaction, since with autocompaction, the interval is generally targeted to the segments we actually want to compact. The issue stems from a change in logic in `CompactionTask#createDataSchemasForIntervals`. Prior to the PR, the interval for the `DataSchema` was determined based solely on the segments actually being processed. Afterwards, it's extended to the umbrella interval of those segments and the user-provided job interval. The relevant change is at: https://github.com/apache/druid/pull/18950/changes#diff-ddc42134e9e789d2b0af5db8c313abd3026f99a919ae26b6137c663f6f2d5228L638-L643 As an effect of this change, the `HashPartitionAnalysis` will contain an entry for every day in the user-supplied interval. That in turn leads the `CachingLocalSegmentAllocator` to generate buckets and shard specs for all possible days. I don't think the MSQ runner would have a problem like this, because the equivalent MSQ statement (like the following SQL) would not attempt to create buckets for all 3.6 million days. It would only process day buckets that actually contained data. Then it would check to see if any tombstones need to be created, and created any needed ones. Critically, it will only create tombstones for day buckets that actually have both (1) some input data and (2) no output data. This happens in `ControllerImpl#findIntervalsToDrop`. ``` REPLACE INTO tbl OVERWRITE WHERE __time >= TIMESTAMP '0001-01-01 00:00:00' AND __time < TIMESTAMP '9999-01-01 00:00:00' SELECT * FROM tbl PARTITIONED BY DAY ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
