imply-cheddar commented on pull request #12331: URL: https://github.com/apache/druid/pull/12331#issuecomment-1067634671
Meta comment: this definitely helps in a sub-set of cases when multiple tasks are covering the same interval, but doesn't help in the case when there are just lots and lots of intervals. If we are taking the time to make this better, I think we can make it better for all cases without too much extra work by making `extractDimensionsFromReport` persist the `StringDistribution` objects to tmp storage (each of the tasks has a tmp storage space that they can use). We can create a directory for each time interval covered and store the `StringDistribution` objects in them. Can use like the task number or something as the file name at that point. Then, when it comes time to determine partitions, we can lazily walk each of the interval directories, merge together the distributions and figure out the partitions. This adds some disk usage and stuff to this job, but this is generally such a small part of the whole cost of a job that even if this mechanism is 10x slower than the current code, nobody will notice it in terms of end-to-end ingestion times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
