paul-rogers commented on PR #13168: URL: https://github.com/apache/druid/pull/13168#issuecomment-1278483340
@599166320, I had a chance to discuss this PR with @imply-cheddar, one of the Druid veterans. His suggestion is to build on the design you started with in the previous PR, using the simplifications you've developed in this one. Specifically: * The sort code should reside in `ScanQueryEngine`: since "that is the thing that implements the scan algorithm", including the custom sorting that you are adding. * Use a priority queue, as you did originally, not a sort algorithm as I suggested. That is, emphasize the small-results and limit use cases. Druid is not as often used for large result sets, so the benefit of a sort-the-list approach are not as important as I'd initially thought. * Use the existing merges at the Historical and Broker level. Or, has he pointed out, there is only one merge, used twice. Here, it is just a matter of defining an `Ordering` with the proper comparisons. Sounds like you've already handled this task. In the scan engine, he suggests that you: * Read the individual rows. * Put them into a limited-sized priority queue. * Create batches only when you pull them out of the priority queue. The result is that `ScanQuery.process` produces a `Sequence` of `ScanQueryValue` object as today, only now they are sorted. He also notes that we have several existing priority-queue based merge implementations you can borrow. It looks like you have, in fact, already done that. We discussed the idea of pushing sorts into the cursor. @imply-cheddar, notes that it is uncommon to have sorts of the form `(__time, a, b)`. So, it is perfectly fine to assume, as you have done, that if the sort has multiple columns, or is a single column other than `__time`, that the code use your sort. No need to worry about the occasional odd multi-column sort that starts with `__time`. Since all the code is in `ScanQueryEngine`, no changes are needed to `ScanQueryRunnerFactory`. Basically, the logic we'd been discussing to put there will shift into `ScanQueryEngine` instead. These all sound like fine improvements that helps this feature be more "Druid-like." Perhaps @imply-cheddar will chime in with additional suggestions. My own suggestion is that, to limit code complexity, create two paths in `ScanQueryEngine`: the existing one and the custom sort one. Otherwise, trying to do them both in the same method could be a bit messy. Thanks for the notes earlier about the other areas. Sounds like we're in good shape there. Once you are ready, I'll take another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
