jihoonson commented on issue #9712: URL: https://github.com/apache/druid/issues/9712#issuecomment-620122737
@yuanlihan thanks for the suggestion. I agree that the auto compaction should be able to skip compaction for segments which are already in a good size. Maybe my previous concern is not a problem anymore if we check both the segment size and whether the segment is already compacted. For example we can skip compaction even for small segments if they are created by compaction. However, the first property `skipSegmentWithSizeBytesGreaterThan` seems possible to introduce a couple of issues. For example, minor compaction couldn't compact segments if their size shows a pattern of (small segment, large segment, small segment, large segment, ...) when large segments are greater than `skipSegmentWithSizeBytesGreaterThan`. Another issue is that the compaction can be used for splitting big segments as well as merging small segments. This might not be practically an issue since minor compaction would be used mostly with streaming ingestion which usually creates small segments, but seems nice if we can still support this case. Maybe we can add `targetRowsPerSegment`, so that auto compaction can skip if a segment has a similar number of rows to it. For the second property, is there a use case where you don't want to compact segments using minor compaction? Or can we always compact if there are 2 or more segments? > Since the type of `partitionSpec` of minor compaction task should be `dynamic` only(is it?), we can adjust `maxRowsPerSegment` if segments created by minor compaction tasks is still too small. This is true for now, but I think it should support all partitionsSpec types. Hash and range partitioning will be a primary use case when they are supported. But there are also needs for supporting `maxRowsPerSegment` for hash and range partitioning to handle data skew. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
