glasser commented on issue #7048: Make IngestSegmentFirehoseFactory splittable 
for parallel ingestion
URL: https://github.com/apache/incubator-druid/pull/7048#issuecomment-462986898
 
 
   That sounds reasonable, with this caveat: you previously said:
   
   > For an interval larger than the max bytes setting, I think each subTask 
should process a subset of segments, so that the large interval can be 
processed in parallel.
   
   and I think what you meant by that is that if there are a bunch of segments 
with the same interval and version (different partition num) that it *would* be 
OK to split them up across sub tasks.
   
   So I think the algorithm would be something like: list the segments for the 
whole interval as a timeline.  Select the first segment, and take the set of 
all segments that overlap it, transitively. If this set of segments has more 
than one interval, then all of those segments are constrained to go in the same 
subtask.  Otherwise, each of the segments in this set (all of which are for the 
same interval) may go in their own subtask.
   
   We've now partitioned the full set of segments into subsets that have to go 
together.  We can then divide the whole list up into subtasks. This is the 
[https://en.wikipedia.org/wiki/Bin_packing_problem](https://en.wikipedia.org/wiki/Bin_packing_problem).
 Rather than try to optimally solve it or even use the first-fit problem, I 
would just use the greedy algorithm that goes down the list of subsets and 
greedily assigns them to subtasks, since this is more likely to get segments of 
the same interval onto the same subtask and thus lead to fewer output segments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to