glasser commented on issue #7048: Make IngestSegmentFirehoseFactory splittable 
for parallel ingestion
URL: https://github.com/apache/incubator-druid/pull/7048#issuecomment-462869668
 
 
   > Would you tell me more about what kind of overlap you're thinking? I think 
it's fine if each task processes a disjoint subset of input segments.
   
   I was just stressing about what happens about combining the results of 
processing two segments from a given time chunk on different subtasks, but I 
think the answer is that everything just works fine.
   
   I have a few further questions about overlapping intervals.
   
   (a) It looks like the 
`/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}` API 
only shows us segments that are fully contained inside the given interval. I 
think that's not what we want — we also want to include partially covered 
intervals (filtering the rows we read from them of course). Do you know if this 
happens to be accessible somewhere in the API? Would it be reasonable to add an 
`includeOverlapping` query parameter to that API which uses 
`theInterval::overlaps` instead of `theInterval::contains `for its filter.
   
   Alternatively, it seems like we could just use the same 
`SegmentListUsedAction` that the `connect` method uses. Seems like this might 
be simpler and more clearly equivalent to the unsplit case.
   
   In either case it looks like the client in question isn't immediately 
accessible. I think I would need to add an injection for CoordinatorClient 
directly into IngestSegmentFirehoseFactory, but it looks like several services 
use IndexingServiceFirehoseModule without binding CoordinatorClient. Would I 
want to add a binding for CoordinatorClient directly to 
IndexingServiceFirehoseModule?
   
   Or if using SegmentListUsedAction I would have to make sure to call 
setToolbox in the supervisor task like it is currently called in the sub task.  
That seems simpler than using CoordinatorClient.
   
   (BTW I notice that when IndexTask calls setToolbox on its firehose factory, 
it knows how to recurse through CombiningFirehoseFactory, but the parallel sub 
task does not do that.  I think I could add a 
TaskToolboxConsumingFirehoseFactory interface implemented by 
IngestSegmentFirehoseFactory and CombiningFirehoseFactory to simplify this and 
make it more consistent.)
   
   (b) I don't really know what happens when we have two segments with 
overlapping intervals and distinct versions.  Eg v1 @ 1:00-3:00, v2 @ 
2:00-4:00.  Is the data from 1:00-2:00 in the first interval supposed to be 
visible, or is the entire segment overshadowed?  If it's the former, then 
specifically, when IngestSegmentFirehoseFactory calls 
   ```
         final List<TimelineObjectHolder<String, DataSegment>> timeLineSegments 
= VersionedIntervalTimeline
             .forSegments(usedSegments)
             .lookup(interval);
   ```
   
   will the overshadowing still work if we're in a subtask where we're only 
passing the first segment in `usedSegments`, or does this calculation have 
enough context to understand that part of the segment should be ignored even 
though the segment that overshadows it is not in `usedSegments` in this 
particular subtask?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to