adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005250264

   I did a bit of investigation.
   
   Using `hits_partitioned` and some massaging I was able to get the expected 
result:
   
   ```sql
   SET datafusion.execution.target_partitions = 1;
   EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE 
"SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10;
   
   > EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE 
"SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10;
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                          |
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------+
   | Plan with Metrics | SortExec: TopK(fetch=10), expr=[SearchPhrase@0 DESC], 
preserve_partitioning=[false], filter=[SearchPhrase@0 IS NULL OR SearchPhrase@0 
> EF83BC09D0B2D0BBD0B0...], metrics=[output_rows=10, 
elapsed_compute=1.284534752s, row_replacements=176]                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                  
                          |
   |                   |   CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=1197862, elapsed_compute=43.685164ms]                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                              
                          |
   |                   |     FilterExec: SearchPhrase@0 != , 
metrics=[output_rows=1197862, elapsed_compute=535.526049ms]                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                         
                          |
   |                   |       DataSourceExec: file_groups={1 group: 
[[Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_0.parquet:0..122446530,
 
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_1.parquet:0..174965044,
 
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_10.parquet:0..101513258,
 
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_11.parquet:0..118419888,
 
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_12.parquet:0..149514164,
 ...]]}, projection=[SearchPhrase], file_type=parquet, predicate=SearchPhrase@0 
!=  AND DynamicFilterPhysicalExpr [ SearchPhrase@0 IS NULL OR SearchPhrase@0 > 
EF83BC09D0B2D0BBD0B0... ], pruning_predicate=SearchPhrase_null_count@2 != 
row_count@3 AND (SearchPhrase_min@0 !=  OR  != SearchPhrase_max@1) AND 
(SearchPhrase_null_count@2 > 0 OR SearchPhrase_null_count@2 != row_count@3 AND 
SearchPhrase_max@1 > EF83BC09D0B2D0BBD0B0...), required_guarantees=
 [SearchPhrase not in ()] |
   |                   | , metrics=[output_rows=8476907, elapsed_compute=1ns, 
bytes_scanned=34921389, file_open_errors=0, file_scan_errors=0, 
files_pruned_statistics=85, num_predicate_creation_errors=0, 
page_index_rows_matched=0, page_index_rows_pruned=0, 
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, 
row_groups_matched_bloom_filter=0, row_groups_matched_statistics=22, 
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=26, 
bloom_filter_eval_time=904.017µs, metadata_load_time=32.92323ms, 
page_index_eval_time=3.56µs, row_pushdown_eval_time=200ns, 
statistics_eval_time=692.725µs, time_elapsed_opening=21.182088ms, 
time_elapsed_processing=1.370805897s, time_elapsed_scanning_total=3.212635043s, 
time_elapsed_scanning_until_data=211.536997ms]                                  
                                                                                
                                                                                
             
                             |
   |                   |                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                          |
   
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------+
   1 row(s) fetched. 
   Elapsed 3.717 seconds.
   ```
   
   (**files_pruned_statistics=85**)
   
   
   
   I had to `ORDER BY "SearchPhrase" DESC`: `ASC` gives no pruning. I don't 
know much about the `SearchPhrase` column or how the data is laid out but I do 
think one of the remaining TODOs for the TopK optimization is relevant here:  
we need to optimize file opening order (based on statistics?) to match the 
preferred order of the query / sort. I opened <issue> to track.
   
   As you'd expect `ASC` took 30s and `DESC` took 3s.
   
   @alamb do you think we can close this issue and continue in <issue>?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to