xhwhis opened a new issue, #18241:
URL: https://github.com/apache/datafusion/issues/18241
### Describe the bug
When `datafusion.execution.collect_statistics = true` is enabled, and
Parquet statistics are only partially collected (i.e., some columns have
statistics disabled), filters using `IS NULL` or `IS NOT NULL` on the
non-statistics columns produce incorrect results.
In this case, queries that should return rows instead return no results,
even though the data clearly exists in the Parquet file.
### To Reproduce
use DataFusion CLI v50.2.0
```
> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.001 seconds.
> COPY (
SELECT
column1 as id,
column2 as value,
FROM (VALUES (1, 10), (2, 20), (3, NULL))
)
TO 'test/data.parquet'
OPTIONS (
'statistics_enabled::value' 'none',
);
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.008 seconds.
> CREATE EXTERNAL TABLE t (
id INT,
value INT,
)
STORED AS PARQUET
LOCATION 'test';
0 row(s) fetched.
Elapsed 0.001 seconds.
> select * from t where value is null;
+----+-------+
| id | value |
+----+-------+
+----+-------+
0 row(s) fetched.
Elapsed 0.007 seconds.
> select * from 'test/data.parquet' where value is not null;
+----+-------+
| id | value |
+----+-------+
+----+-------+
0 row(s) fetched.
Elapsed 0.002 seconds.
> select * from 'test/data.parquet';
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 2 | 20 |
| 3 | NULL |
+----+-------+
3 row(s) fetched.
Elapsed 0.002 seconds.
> select * from test;
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 2 | 20 |
| 3 | NULL |
+----+-------+
3 row(s) fetched.
Elapsed 0.002 seconds.
select
path_in_schema,stats_min,stats_max,stats_null_count,stats_min_value,stats_max_value
from parquet_metadata('test/data.parquet');
+----------------+-----------+-----------+------------------+-----------------+-----------------+
| path_in_schema | stats_min | stats_max | stats_null_count |
stats_min_value | stats_max_value |
+----------------+-----------+-----------+------------------+-----------------+-----------------+
| "id" | 1 | 3 | 0 | 1
| 3 |
| "value" | NULL | NULL | NULL | NULL
| NULL |
+----------------+-----------+-----------+------------------+-----------------+-----------------+
2 row(s) fetched.
Elapsed 0.002 seconds.
```
### Expected behavior
Filtering with `IS NULL` / `IS NOT NULL` should still correctly evaluate
against the file data even if statistics for a column are missing.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]