alamb opened a new issue, #4002: URL: https://github.com/apache/arrow-datafusion/issues/4002
**Describe the bug** When I enable page index filtering incorrect answers result NOTE that page index filtering is not enabled by default (as we are still working on it) so this issue will not likely affect users: **To Reproduce** 1. Download data from [repro.zip](https://github.com/apache/arrow-datafusion/files/9890414/repro.zip) 2. Run datafusion CLI: **Expected behavior** Same answer should be produced with and without page index filtering enabled. However, the answers are different Without page index `15963` rows are produced ```rust (arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=false datafusion-cli -f script.sql DataFusion CLI v13.0.0 0 rows in set. Query took 0.001 seconds. +-------------------------------------------------+---------+ | name | setting | +-------------------------------------------------+---------+ | datafusion.execution.batch_size | 8192 | | datafusion.execution.coalesce_batches | true | | datafusion.execution.coalesce_target_batch_size | 4096 | | datafusion.execution.parquet.enable_page_index | false | | datafusion.execution.parquet.pushdown_filters | false | | datafusion.execution.parquet.reorder_filters | false | | datafusion.execution.time_zone | UTC | | datafusion.explain.logical_plan_only | false | | datafusion.explain.physical_plan_only | false | | datafusion.optimizer.filter_null_join_keys | false | | datafusion.optimizer.max_passes | 3 | | datafusion.optimizer.skip_failed_rules | true | +-------------------------------------------------+---------+ 12 rows in set. Query took 0.001 seconds. +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 53819 | +-----------------+ 1 row in set. Query took 0.002 seconds. +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 15963 | +-----------------+ 1 row in set. Query took 0.002 seconds. ``` *WITH* page filtering, `0` rows are produced 😱 ```shell (arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=true datafusion-cli -f script.sql DataFusion CLI v13.0.0 0 rows in set. Query took 0.001 seconds. +-------------------------------------------------+---------+ | name | setting | +-------------------------------------------------+---------+ | datafusion.execution.batch_size | 8192 | | datafusion.execution.coalesce_batches | true | | datafusion.execution.coalesce_target_batch_size | 4096 | | datafusion.execution.parquet.enable_page_index | true | | datafusion.execution.parquet.pushdown_filters | false | | datafusion.execution.parquet.reorder_filters | false | | datafusion.execution.time_zone | UTC | | datafusion.explain.logical_plan_only | false | | datafusion.explain.physical_plan_only | false | | datafusion.optimizer.filter_null_join_keys | false | | datafusion.optimizer.max_passes | 3 | | datafusion.optimizer.skip_failed_rules | true | +-------------------------------------------------+---------+ 12 rows in set. Query took 0.001 seconds. +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 53819 | +-----------------+ 1 row in set. Query took 0.002 seconds. +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 0 | +-----------------+ 1 row in set. Query took 0.002 seconds. ``` **Additional context** I found this issue and reproducer while working on the integration test https://github.com/apache/arrow-datafusion/pull/3976 I suspect @Ted-Jiang is already working on this issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org