imply-cheddar commented on PR #13168: URL: https://github.com/apache/druid/pull/13168#issuecomment-1278501898
Thanks @paul-rogers for the write up of our discussion. I do still think that some changes will be warranted in the `QueryRunnerFactory` just fewer of them. Specific to `QueryRunnerFactory` the code before this PR does a lot of stuff to organize the `QueryRunners` in time-interval order in order to short-circuit logic. That logic only makes sense when the sort starts with `time`. The key, though, is that the merge logic can assume that it is working with results from the segments that are merged. To Paul's point, when we've decided which algorithm to use, we can be in a code path that only does that one thing. So, one suggestion would be to maybe add a new method: `ScanQueryEngine.processWithMultiColumnSort`. This method will assume that it was called because you want a multi-column sort and would then return a set of things that are sorted "properly". The primary complexity of my suggestion of building batches of pre-sorted rows is that if the query actually does want to scan all of the rows of the segment (i.e. there is no limit applied and the filters match millions), you will end up doing multiple passes over the segment (each pass would generate the next set of values). This is an acceptable trade-off as it's okay for these types of queries to be slow (i.e. if you have a data set sorted by X and need it sorted by Y and it's billions of rows, sheer physics is going to dictate that the query is not necessarily going to be the fastest thing). In order to implement this, you will need to be able to keep track of the previous end point. You will also need to "break ties" by using the rowId so that if there are multiple rows that share the same values for the sort keys, you can deterministically know which ones to include in which batch. Hopefully this all makes sense... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
