adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005250264
I did a bit of investigation. Using `hits_partitioned` and some massaging I was able to get the expected result: ```sql SET datafusion.execution.target_partitions = 1; EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10; > EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10; +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------+ | plan_type | plan | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------+ | Plan with Metrics | SortExec: TopK(fetch=10), expr=[SearchPhrase@0 DESC], preserve_partitioning=[false], filter=[SearchPhrase@0 IS NULL OR SearchPhrase@0 > EF83BC09D0B2D0BBD0B0...], metrics=[output_rows=10, elapsed_compute=1.284534752s, row_replacements=176] | | | CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=1197862, elapsed_compute=43.685164ms] | | | FilterExec: SearchPhrase@0 != , metrics=[output_rows=1197862, elapsed_compute=535.526049ms] | | | DataSourceExec: file_groups={1 group: [[Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_0.parquet:0..122446530, Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_1.parquet:0..174965044, Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_10.parquet:0..101513258, Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_11.parquet:0..118419888, Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_12.parquet:0..149514164, ...]]}, projection=[SearchPhrase], file_type=parquet, predicate=SearchPhrase@0 != AND DynamicFilterPhysicalExpr [ SearchPhrase@0 IS NULL OR SearchPhrase@0 > EF83BC09D0B2D0BBD0B0... ], pruning_predicate=SearchPhrase_null_count@2 != row_count@3 AND (SearchPhrase_min@0 != OR != SearchPhrase_max@1) AND (SearchPhrase_null_count@2 > 0 OR SearchPhrase_null_count@2 != row_count@3 AND SearchPhrase_max@1 > EF83BC09D0B2D0BBD0B0...), required_guarantees= [SearchPhrase not in ()] | | | , metrics=[output_rows=8476907, elapsed_compute=1ns, bytes_scanned=34921389, file_open_errors=0, file_scan_errors=0, files_pruned_statistics=85, num_predicate_creation_errors=0, page_index_rows_matched=0, page_index_rows_pruned=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=22, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=26, bloom_filter_eval_time=904.017µs, metadata_load_time=32.92323ms, page_index_eval_time=3.56µs, row_pushdown_eval_time=200ns, statistics_eval_time=692.725µs, time_elapsed_opening=21.182088ms, time_elapsed_processing=1.370805897s, time_elapsed_scanning_total=3.212635043s, time_elapsed_scanning_until_data=211.536997ms] | | | | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------+ 1 row(s) fetched. Elapsed 3.717 seconds. ``` (**files_pruned_statistics=85**) I had to `ORDER BY "SearchPhrase" DESC`: `ASC` gives no pruning. I don't know much about the `SearchPhrase` column or how the data is laid out but I do think one of the remaining TODOs for the TopK optimization is relevant here: we need to optimize file opening order (based on statistics?) to match the preferred order of the query / sort. I opened <issue> to track. As you'd expect `ASC` took 30s and `DESC` took 3s. @alamb do you think we can close this issue and continue in <issue>? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org