[GitHub] [druid] imply-cheddar commented on pull request #13168: ScanQuery supports multi column orderBy queries

GitBox Thu, 13 Oct 2022 22:27:35 -0700


imply-cheddar commented on PR #13168:
URL: https://github.com/apache/druid/pull/13168#issuecomment-1278501898


   Thanks @paul-rogers for the write up of our discussion.  I do still think 
that some changes will be warranted in the `QueryRunnerFactory` just fewer of 
them.  Specific to `QueryRunnerFactory` the code before this PR does a lot of 
stuff to organize the `QueryRunners` in time-interval order in order to 
short-circuit logic.  That logic only makes sense when the sort starts with 
`time`.  The key, though, is that the merge logic can assume that it is working 
with results from the segments that are merged.
   
   To Paul's point, when we've decided which algorithm to use, we can be in a 
code path that only does that one thing.  So, one suggestion would be to maybe 
add a new method: `ScanQueryEngine.processWithMultiColumnSort`.  This method 
will assume that it was called because you want a multi-column sort and would 
then return a set of things that are sorted "properly".
   
   The primary complexity of my suggestion of building batches of pre-sorted 
rows is that if the query actually does want to scan all of the rows of the 
segment (i.e. there is no limit applied and the filters match millions), you 
will end up doing multiple passes over the segment (each pass would generate 
the next set of values).  This is an acceptable trade-off as it's okay for 
these types of queries to be slow (i.e. if you have a data set sorted by X and 
need it sorted by Y and it's billions of rows, sheer physics is going to 
dictate that the query is not necessarily going to be the fastest thing).
   
   In order to implement this, you will need to be able to keep track of the 
previous end point.  You will also need to "break ties" by using the rowId so 
that if there are multiple rows that share the same values for the sort keys, 
you can deterministically know which ones to include in which batch.
   
   Hopefully this all makes sense...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] imply-cheddar commented on pull request #13168: ScanQuery supports multi column orderBy queries

Reply via email to