alamb opened a new issue, #4002:
URL: https://github.com/apache/arrow-datafusion/issues/4002

   **Describe the bug**
   When I enable page index filtering incorrect answers result
   
   NOTE that page index filtering is not enabled by default (as we are still 
working on it) so this issue will not likely affect users:
   
   
   **To Reproduce**
   1. Download data from 
[repro.zip](https://github.com/apache/arrow-datafusion/files/9890414/repro.zip)
   2. Run datafusion CLI:
   
   **Expected behavior**
   Same answer should be produced with and without page index filtering 
enabled. However, the answers are different
   
   
   Without page index `15963` rows are produced
   
   ```rust
   (arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ 
DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=false datafusion-cli -f 
script.sql 
   DataFusion CLI v13.0.0
   0 rows in set. Query took 0.001 seconds.
   +-------------------------------------------------+---------+
   | name                                            | setting |
   +-------------------------------------------------+---------+
   | datafusion.execution.batch_size                 | 8192    |
   | datafusion.execution.coalesce_batches           | true    |
   | datafusion.execution.coalesce_target_batch_size | 4096    |
   | datafusion.execution.parquet.enable_page_index  | false   |
   | datafusion.execution.parquet.pushdown_filters   | false   |
   | datafusion.execution.parquet.reorder_filters    | false   |
   | datafusion.execution.time_zone                  | UTC     |
   | datafusion.explain.logical_plan_only            | false   |
   | datafusion.explain.physical_plan_only           | false   |
   | datafusion.optimizer.filter_null_join_keys      | false   |
   | datafusion.optimizer.max_passes                 | 3       |
   | datafusion.optimizer.skip_failed_rules          | true    |
   +-------------------------------------------------+---------+
   12 rows in set. Query took 0.001 seconds.
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 53819           |
   +-----------------+
   1 row in set. Query took 0.002 seconds.
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 15963           |
   +-----------------+
   1 row in set. Query took 0.002 seconds.
   
   ```
   
   *WITH* page filtering, `0` rows are produced 😱 
   
   ```shell
   (arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ 
DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=true datafusion-cli -f 
script.sql 
   DataFusion CLI v13.0.0
   0 rows in set. Query took 0.001 seconds.
   +-------------------------------------------------+---------+
   | name                                            | setting |
   +-------------------------------------------------+---------+
   | datafusion.execution.batch_size                 | 8192    |
   | datafusion.execution.coalesce_batches           | true    |
   | datafusion.execution.coalesce_target_batch_size | 4096    |
   | datafusion.execution.parquet.enable_page_index  | true    |
   | datafusion.execution.parquet.pushdown_filters   | false   |
   | datafusion.execution.parquet.reorder_filters    | false   |
   | datafusion.execution.time_zone                  | UTC     |
   | datafusion.explain.logical_plan_only            | false   |
   | datafusion.explain.physical_plan_only           | false   |
   | datafusion.optimizer.filter_null_join_keys      | false   |
   | datafusion.optimizer.max_passes                 | 3       |
   | datafusion.optimizer.skip_failed_rules          | true    |
   +-------------------------------------------------+---------+
   12 rows in set. Query took 0.001 seconds.
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 53819           |
   +-----------------+
   1 row in set. Query took 0.002 seconds.
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 0               |
   +-----------------+
   1 row in set. Query took 0.002 seconds.
   ```
   
   **Additional context**
   I found this issue and reproducer while working on the integration test 
https://github.com/apache/arrow-datafusion/pull/3976
   
   I suspect @Ted-Jiang  is already working on this issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to