[GitHub] gianm opened a new issue #6088: Scan query: time-ordering

GitBox Wed, 01 Aug 2018 03:03:01 -0700

gianm opened a new issue #6088: Scan query: time-ordering
URL: https://github.com/apache/incubator-druid/issues/6088
 
 
   Currently, the Select query is "better" than Scan in one case: it supports 
time ordering, so you can do queries like "latest 1000 records". But sadly, 
Select queries are known for excessive memory use (see #5006) so we would like 
to replace them with Scan whenever possible.
   
   We could support time-ordering for Scan using an approach like the following:
   
   1. Analyze the segment timeline and start from the first (or last, if 
descending) chunk.
   2. If the limit is "low" (below some reasonable number like 100000, let's 
say) then maintain a priority queue of size = limit, and find the earliest (or 
latest) rows in the current chunk by scanning each segment in turn. If the 
current chunk can fill up the priority queue then we are done. If not then move 
on to the next chunk.
   3. If the limit is "high" then that approach won't work: it will use too 
much memory. Instead, we can do an N-way merge sort of the individual segments 
for the current chunk and send those results back to the client.
   4. If there are too many segments for an N-way merge sort (100s of segments 
in the time chunk) then _that_ approach won't work: it will open up too many 
column selectors at once (each one has overhead: it needs 
decompression/decoding buffers). Instead, we can do a multi-level merge sort on 
disk. This is kind of lame (it will really slow down the query) but it's still 
better than what Select would do, which is crash the machine.
   
   IMO, implementing only (2) will let people move more workloads to Scan 
(mostly stuff like "find the most recent X rows"), and so it would be a good 
start to just do that by itself.
   
   Tagging this as "SQL" too, since if we can improve Scan to handle more 
cases, then we should also switch over the Druid SQL planner to use it instead 
of Select in those cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] gianm opened a new issue #6088: Scan query: time-ordering

Reply via email to