adriangb commented on issue #16402:
URL: https://github.com/apache/datafusion/issues/16402#issuecomment-3005250264
I did a bit of investigation.
Using `hits_partitioned` and some massaging I was able to get the expected
result:
```sql
SET datafusion.execution.target_partitions = 1;
EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE
"SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10;
> EXPLAIN ANALYZE SELECT "SearchPhrase" FROM 'hits_partitioned' WHERE
"SearchPhrase" <> '' ORDER BY "SearchPhrase" DESC LIMIT 10;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------+
| plan_type | plan
|
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------+
| Plan with Metrics | SortExec: TopK(fetch=10), expr=[SearchPhrase@0 DESC],
preserve_partitioning=[false], filter=[SearchPhrase@0 IS NULL OR SearchPhrase@0
> EF83BC09D0B2D0BBD0B0...], metrics=[output_rows=10,
elapsed_compute=1.284534752s, row_replacements=176]
|
| | CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=1197862, elapsed_compute=43.685164ms]
|
| | FilterExec: SearchPhrase@0 != ,
metrics=[output_rows=1197862, elapsed_compute=535.526049ms]
|
| | DataSourceExec: file_groups={1 group:
[[Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_0.parquet:0..122446530,
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_1.parquet:0..174965044,
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_10.parquet:0..101513258,
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_11.parquet:0..118419888,
Users/adriangb/GitHub/datafusion/benchmarks/data/hits_partitioned/hits_12.parquet:0..149514164,
...]]}, projection=[SearchPhrase], file_type=parquet, predicate=SearchPhrase@0
!= AND DynamicFilterPhysicalExpr [ SearchPhrase@0 IS NULL OR SearchPhrase@0 >
EF83BC09D0B2D0BBD0B0... ], pruning_predicate=SearchPhrase_null_count@2 !=
row_count@3 AND (SearchPhrase_min@0 != OR != SearchPhrase_max@1) AND
(SearchPhrase_null_count@2 > 0 OR SearchPhrase_null_count@2 != row_count@3 AND
SearchPhrase_max@1 > EF83BC09D0B2D0BBD0B0...), required_guarantees=
[SearchPhrase not in ()] |
| | , metrics=[output_rows=8476907, elapsed_compute=1ns,
bytes_scanned=34921389, file_open_errors=0, file_scan_errors=0,
files_pruned_statistics=85, num_predicate_creation_errors=0,
page_index_rows_matched=0, page_index_rows_pruned=0,
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0,
row_groups_matched_bloom_filter=0, row_groups_matched_statistics=22,
row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=26,
bloom_filter_eval_time=904.017µs, metadata_load_time=32.92323ms,
page_index_eval_time=3.56µs, row_pushdown_eval_time=200ns,
statistics_eval_time=692.725µs, time_elapsed_opening=21.182088ms,
time_elapsed_processing=1.370805897s, time_elapsed_scanning_total=3.212635043s,
time_elapsed_scanning_until_data=211.536997ms]
|
| |
|
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------+
1 row(s) fetched.
Elapsed 3.717 seconds.
```
(**files_pruned_statistics=85**)
I had to `ORDER BY "SearchPhrase" DESC`: `ASC` gives no pruning. I don't
know much about the `SearchPhrase` column or how the data is laid out but I do
think one of the remaining TODOs for the TopK optimization is relevant here:
we need to optimize file opening order (based on statistics?) to match the
preferred order of the query / sort. I opened <issue> to track.
As you'd expect `ASC` took 30s and `DESC` took 3s.
@alamb do you think we can close this issue and continue in <issue>?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]