xudong963 opened a new issue, #18860: URL: https://github.com/apache/datafusion/issues/18860
Let's see an example: ``` SELECT * FROM t1 WHERE ticker = 'ABT' AND window_start <= (1761364799999999999 AT TIME ZONE 'America/New_York') AND window_start >= (1698465600000000000 AT TIME ZONE 'America/New_York') LIMIT 1 ``` After the kinds of pruning inside DF, we can get some matched row groups. Assume that there are still four row groups remaining after row group pruning by `ticker = 'ABT' AND window_start <= (1761364799999999999 AT TIME ZONE 'America/New_York') AND window_start >= (1698465600000000000 AT TIME ZONE 'America/New_York')` and related statistics in row group metadata. And one (row group 3) of the four is fully matched with the predicates, and the others are not, which are only partially matched. The worst case is that we need to scan rg1, rg2, then find a matched row in rg3. But suppose we know the rg3 is fully matched (currently, DF only supports partial-matched). In that case, we can only check it if the total row count across all the rg3 meets or exceeds the threshold defined by k in the LIMIT clause, then we definitely can reduce the scanned data and skip other rows. If there are no fully matched rg, we can extend limit pruning to the page level, to find the fully matched pages. --- I plan to support row group level limit pruning first, then explore the page level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
