gianm opened a new issue #6088: Scan query: time-ordering URL: https://github.com/apache/incubator-druid/issues/6088 Currently, the Select query is "better" than Scan in one case: it supports time ordering, so you can do queries like "latest 1000 records". But sadly, Select queries are known for excessive memory use (see #5006) so we would like to replace them with Scan whenever possible. We could support time-ordering for Scan using an approach like the following: 1. Analyze the segment timeline and start from the first (or last, if descending) chunk. 2. If the limit is "low" (below some reasonable number like 100000, let's say) then maintain a priority queue of size = limit, and find the earliest (or latest) rows in the current chunk by scanning each segment in turn. If the current chunk can fill up the priority queue then we are done. If not then move on to the next chunk. 3. If the limit is "high" then that approach won't work: it will use too much memory. Instead, we can do an N-way merge sort of the individual segments for the current chunk and send those results back to the client. 4. If there are too many segments for an N-way merge sort (100s of segments in the time chunk) then _that_ approach won't work: it will open up too many column selectors at once (each one has overhead: it needs decompression/decoding buffers). Instead, we can do a multi-level merge sort on disk. This is kind of lame (it will really slow down the query) but it's still better than what Select would do, which is crash the machine. IMO, implementing only (2) will let people move more workloads to Scan (mostly stuff like "find the most recent X rows"), and so it would be a good start to just do that by itself. Tagging this as "SQL" too, since if we can improve Scan to handle more cases, then we should also switch over the Druid SQL planner to use it instead of Select in those cases.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
