imply-cheddar commented on pull request #12331:
URL: https://github.com/apache/druid/pull/12331#issuecomment-1067634671


   Meta comment: this definitely helps in a sub-set of cases when multiple 
tasks are covering the same interval, but doesn't help in the case when there 
are just lots and lots of intervals.  If we are taking the time to make this 
better, I think we can make it better for all cases without too much extra work 
by making `extractDimensionsFromReport` persist the `StringDistribution` 
objects to tmp storage (each of the tasks has a tmp storage space that they can 
use).  We can create a directory for each time interval covered and store the 
`StringDistribution` objects in them.  Can use like the task number or 
something as the file name at that point.
   
   Then, when it comes time to determine partitions, we can lazily walk each of 
the interval directories, merge together the distributions and figure out the 
partitions.
   
   This adds some disk usage and stuff to this job, but this is generally such 
a small part of the whole cost of a job that even if this mechanism is 10x 
slower than the current code, nobody will notice it in terms of end-to-end 
ingestion times.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to