adriangb commented on PR #13795: URL: https://github.com/apache/datafusion/pull/13795#issuecomment-2548953054
I wanted to check that the assumption that falling back on min/max stats when row / null counts are missing/unknown is safe _even when all rows are null_ and _we don't know that all of the rows are null_. I was afraid that writers might populate the statistics with a default value (e.g. `0`) if all of the rows are null instead of `null`. And that some other pass might then remove the information on row count / null counts resulting in: | row_count | null_count | min | max | |-----------|------------|-----|-----| | null | null | 0 | 0 | I believe both the previous implementation and this implementation would decide to exclude the container for the predicate `col > 1`: ```sql CASE WHEN null = null THEN false ELSE col_min > 1 END` ``` ```sql null != null AND col_min > 1 ``` Both of these cases return `false` indicating that *we know we can prune the container*. But that would be wrong! I tested using our writer: ```rust use std::sync::Arc; use parquet::arrow::ArrowWriter; #[tokio::main] async fn main() { let array = arrow::array::Int32Array::from(vec![None]); let array = Arc::new(array); let schema = arrow::datatypes::Schema::new(vec![arrow::datatypes::Field::new("a", arrow::datatypes::DataType::Int32, true)]); let schema = Arc::new(schema); let batch = arrow::record_batch::RecordBatch::try_new(schema.clone(), vec![array]).unwrap(); let mut file = std::fs::File::create("test.parquet").unwrap(); let mut writer = ArrowWriter::try_new(&mut file, schema.clone(), None).unwrap(); writer.write(&batch).unwrap(); writer.close().unwrap(); } ``` ``` ❯ parquet meta test.parquet Row group 0: count: 1 38.00 B records start: 4 total(compressed): 38 B total(uncompressed):38 B -------------------------------------------------------------------------------- type encodings count avg size nulls min / max a INT32 _ RB_ 1 38.00 B 1 ``` It does what I think is correct and sets min / max to `null`. If any writer were to set the min / max values to defaults instead of nulls I think that would be a bug in the writer. And there would have to be a further bug in the writer or other part of the system to discard or never record the fact that `null_count = row_count`. So I think this is not a realistic scenario to hit, even if it theoretically could cause issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org