jihoonson commented on issue #9712:
URL: https://github.com/apache/druid/issues/9712#issuecomment-620122737


   @yuanlihan thanks for the suggestion. I agree that the auto compaction 
should be able to skip compaction for segments which are already in a good 
size. Maybe my previous concern is not a problem anymore if we check both the 
segment size and whether the segment is already compacted. For example we can 
skip compaction even for small segments if they are created by compaction.
   
   However, the first property `skipSegmentWithSizeBytesGreaterThan` seems 
possible to introduce a couple of issues. For example, minor compaction 
couldn't compact segments if their size shows a pattern of (small segment, 
large segment, small segment, large segment, ...) when large segments are 
greater than `skipSegmentWithSizeBytesGreaterThan`. Another issue is that the 
compaction can be used for splitting big segments as well as merging small 
segments. This might not be practically an issue since minor compaction would 
be used mostly with streaming ingestion which usually creates small segments, 
but seems nice if we can still support this case. Maybe we can add 
`targetRowsPerSegment`, so that auto compaction can skip if a segment has a 
similar number of rows to it.
   
   For the second property, is there a use case where you don't want to compact 
segments using minor compaction? Or can we always compact if there are 2 or 
more segments?
   
   > Since the type of `partitionSpec` of minor compaction task should be 
`dynamic` only(is it?), we can adjust `maxRowsPerSegment` if segments created 
by minor compaction tasks is still too small.
   
   This is true for now, but I think it should support all partitionsSpec 
types. Hash and range partitioning will be a primary use case when they are 
supported. But there are also needs for supporting `maxRowsPerSegment` for hash 
and range partitioning to handle data skew. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to