jihoonson commented on issue #7048: Make IngestSegmentFirehoseFactory 
splittable for parallel ingestion
URL: https://github.com/apache/incubator-druid/pull/7048#issuecomment-462551089
 
 
   > Are you imagining that the split implementation would query the segments 
metadata to learn all the segment sizes and the user would specify bytes per 
split? Would we try to not divide any input segments but just chunk them 
together?
   
   Yes, this is exactly what I want. The task can ask the coordinator to get 
the segment metadata. Tasks use `CoordinatorClient` when they talk with the 
coordinator, so you may want to add a new method to it which calls 
`/druid/coordinator/v1/datasources/{dataSourceName}/intervals/{interval}?simple`
 (http://druid.io/docs/latest/operations/api-reference.html#datasources). Its 
result is like below:
   
   ```json
   {
     "2019-02-11T23:00:00.000Z/2019-02-12T00:00:00.000Z": {
       "size": 21459255,
       "count": 2
     },
     "2019-02-11T22:00:00.000Z/2019-02-11T23:00:00.000Z": {
       "size": 24510542,
       "count": 2
     }
   }
   ```
   
   > This seems like a reasonable option to desire but I kind of feel like 
people might still want to get started with the simpler "I know my peons can 
handle an hour of data, just split by hours" anyway... so implementing one of 
these options doesn't necessarily stop from implementing the other later.
   
   My feeling is that `maxInputSegmentBytesPerTask` is simpler than 
`taskGranularity` because , with `taskGranularity`, people should think about 
how many segments are in each time chunk and each task can handle it. However, 
for `maxInputSegmentBytesPerTask`, they can set it to whatever a task can 
handle. I think we can provide a default, so that people even don't have to 
think about it in most cases.
   
   If `taskGranularity` is better than `maxInputSegmentBytesPerTask` in some 
cases, I'm fine with adding it. But, I don't think of anything. Do you have 
something in your mind?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to