jihoonson commented on issue #7048: Make IngestSegmentFirehoseFactory splittable for parallel ingestion URL: https://github.com/apache/incubator-druid/pull/7048#issuecomment-462551089 > Are you imagining that the split implementation would query the segments metadata to learn all the segment sizes and the user would specify bytes per split? Would we try to not divide any input segments but just chunk them together? Yes, this is exactly what I want. The task can ask the coordinator to get the segment metadata. Tasks use `CoordinatorClient` when they talk with the coordinator, so you may want to add a new method to it which calls `/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?simple` (http://druid.io/docs/latest/operations/api-reference.html#datasources). Its result is like below: ```json { "2019-02-11T23:00:00.000Z/2019-02-12T00:00:00.000Z": { "size": 21459255, "count": 2 }, "2019-02-11T22:00:00.000Z/2019-02-11T23:00:00.000Z": { "size": 24510542, "count": 2 } } ``` > This seems like a reasonable option to desire but I kind of feel like people might still want to get started with the simpler "I know my peons can handle an hour of data, just split by hours" anyway... so implementing one of these options doesn't necessarily stop from implementing the other later. My feeling is that `maxInputSegmentBytesPerTask` is simpler than `taskGranularity` because , with `taskGranularity`, people should think about how many segments are in each time chunk and each task can handle it. However, for `maxInputSegmentBytesPerTask`, they can set it to whatever a task can handle. I think we can provide a default, so that people even don't have to think about it in most cases. If `taskGranularity` is better than `maxInputSegmentBytesPerTask` in some cases, I'm fine with adding it. But, I don't think of anything. Do you have something in your mind?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
