[GitHub] [druid] paul-rogers commented on pull request #13168: ScanQuery supports multi column orderBy queries

GitBox Thu, 13 Oct 2022 21:56:49 -0700


paul-rogers commented on PR #13168:
URL: https://github.com/apache/druid/pull/13168#issuecomment-1278483340


   @599166320, I had a chance to discuss this PR with @imply-cheddar, one of 
the Druid veterans. His suggestion is to build on the design you started with 
in the previous PR, using the simplifications you've developed in this one. 
Specifically:
   
   * The sort code should reside in `ScanQueryEngine`:  since "that is the 
thing that implements the scan algorithm", including the custom sorting that 
you are adding.
   * Use a priority queue, as you did originally, not a sort algorithm as I 
suggested. That is, emphasize the small-results and limit use cases. Druid is 
not as often used for large result sets, so the benefit of a sort-the-list 
approach are not as important as I'd initially thought.
   * Use the existing merges at the Historical and Broker level. Or, has he 
pointed out, there is only one merge, used twice. Here, it is just a matter of 
defining an `Ordering` with the proper comparisons. Sounds like you've already 
handled this task.
   
   In the scan engine, he suggests that you:
   
   * Read the individual rows.
   * Put them into a limited-sized priority queue.
   * Create batches only when you pull them out of the priority queue.
   
   The result is that `ScanQuery.process` produces a `Sequence` of 
`ScanQueryValue` object as today, only now they are sorted.
   
   He also notes that we have several existing priority-queue based merge 
implementations you can borrow. It looks like you have, in fact, already done 
that.
   
   We discussed the idea of pushing sorts into the cursor. @imply-cheddar, 
notes that it is uncommon to have sorts of the form `(__time, a, b)`. So, it is 
perfectly fine to assume, as you have done, that if the sort has multiple 
columns, or is a single column other than `__time`, that the code use your 
sort. No need to worry about the occasional odd multi-column sort that starts 
with `__time`.
   
   Since all the code is in `ScanQueryEngine`, no changes are needed to 
`ScanQueryRunnerFactory`. Basically, the logic we'd been discussing to put 
there will shift into `ScanQueryEngine` instead.
   
   These all sound like fine improvements that helps this feature be more 
"Druid-like." Perhaps @imply-cheddar will chime in with additional suggestions.
   
   My own suggestion is that, to limit code complexity, create two paths in 
`ScanQueryEngine`: the existing one and the custom sort one. Otherwise, trying 
to do them both in the same method could be a bit messy.
   
   Thanks for the notes earlier about the other areas. Sounds like we're in 
good shape there.
   
   Once you are ready, I'll take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on pull request #13168: ScanQuery supports multi column orderBy queries

Reply via email to