glasser commented on issue #7048: Make IngestSegmentFirehoseFactory splittable for parallel ingestion URL: https://github.com/apache/incubator-druid/pull/7048#issuecomment-462869668 > Would you tell me more about what kind of overlap you're thinking? I think it's fine if each task processes a disjoint subset of input segments. I was just stressing about what happens about combining the results of processing two segments from a given time chunk on different subtasks, but I think the answer is that everything just works fine. I have a few further questions about overlapping intervals. (a) It looks like the `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}` API only shows us segments that are fully contained inside the given interval. I think that's not what we want — we also want to include partially covered intervals (filtering the rows we read from them of course). Do you know if this happens to be accessible somewhere in the API? Would it be reasonable to add an `includeOverlapping` query parameter to that API which uses `theInterval::overlaps` instead of `theInterval::contains `for its filter. Alternatively, it seems like we could just use the same `SegmentListUsedAction` that the `connect` method uses. Seems like this might be simpler and more clearly equivalent to the unsplit case. In either case it looks like the client in question isn't immediately accessible. I think I would need to add an injection for CoordinatorClient directly into IngestSegmentFirehoseFactory, but it looks like several services use IndexingServiceFirehoseModule without binding CoordinatorClient. Would I want to add a binding for CoordinatorClient directly to IndexingServiceFirehoseModule? Or if using SegmentListUsedAction I would have to make sure to call setToolbox in the supervisor task like it is currently called in the sub task. That seems simpler than using CoordinatorClient. (BTW I notice that when IndexTask calls setToolbox on its firehose factory, it knows how to recurse through CombiningFirehoseFactory, but the parallel sub task does not do that. I think I could add a TaskToolboxConsumingFirehoseFactory interface implemented by IngestSegmentFirehoseFactory and CombiningFirehoseFactory to simplify this and make it more consistent.) (b) I don't really know what happens when we have two segments with overlapping intervals and distinct versions. Eg v1 @ 1:00-3:00, v2 @ 2:00-4:00. Is the data from 1:00-2:00 in the first interval supposed to be visible, or is the entire segment overshadowed? If it's the former, then specifically, when IngestSegmentFirehoseFactory calls ``` final List<TimelineObjectHolder<String, DataSegment>> timeLineSegments = VersionedIntervalTimeline .forSegments(usedSegments) .lookup(interval); ``` will the overshadowing still work if we're in a subtask where we're only passing the first segment in `usedSegments`, or does this calculation have enough context to understand that part of the segment should be ignored even though the segment that overshadows it is not in `usedSegments` in this particular subtask?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
