adriangb commented on PR #13795:
URL: https://github.com/apache/datafusion/pull/13795#issuecomment-2548953054

   I wanted to check that the assumption that falling back on min/max stats 
when row / null counts are missing/unknown is safe _even when all rows are 
null_ and _we don't know that all of the rows are null_.
   
   I was afraid that writers might populate the statistics with a default value 
(e.g. `0`) if all of the rows are null instead of `null`.
   And that some other pass might then remove the information on row count / 
null counts resulting in:
   
   | row_count | null_count | min | max |
   |-----------|------------|-----|-----|
   | null      | null       | 0   | 0   |
   
   I believe both the previous implementation and this implementation would 
decide to exclude the container for the predicate `col > 1`:
   
   ```sql
   CASE WHEN null = null THEN false ELSE col_min > 1 END`
   ```
   ```sql
   null != null AND col_min > 1
   ```
   
   Both of these cases return `false` indicating that *we know we can prune the 
container*.
   But that would be wrong!
   
   I tested using our writer:
   
   ```rust
   use std::sync::Arc;
   
   use parquet::arrow::ArrowWriter;
   
   #[tokio::main]
   async fn main() {
       let array = arrow::array::Int32Array::from(vec![None]);
       let array = Arc::new(array);
       let schema = 
arrow::datatypes::Schema::new(vec![arrow::datatypes::Field::new("a", 
arrow::datatypes::DataType::Int32, true)]);
       let schema = Arc::new(schema);
       let batch = arrow::record_batch::RecordBatch::try_new(schema.clone(), 
vec![array]).unwrap();
   
       let mut file = std::fs::File::create("test.parquet").unwrap();
       let mut writer = ArrowWriter::try_new(&mut file, schema.clone(), 
None).unwrap();
       writer.write(&batch).unwrap();
       writer.close().unwrap();
   }
   ```
   
   ```
   ❯ parquet meta test.parquet 
   Row group 0:  count: 1  38.00 B records  start: 4  total(compressed): 38 B 
total(uncompressed):38 B 
   
--------------------------------------------------------------------------------
      type      encodings count     avg size   nulls   min / max
   a  INT32     _ RB_     1         38.00 B    1 
   ```
   
   It does what I think is correct and sets min / max to `null`.
   If any writer were to set the min / max values to defaults instead of nulls 
I think that would be a bug in the writer.
   And there would have to be a further bug in the writer or other part of the 
system to discard or never record the fact that `null_count = row_count`.
   
   So I think this is not a realistic scenario to hit, even if it theoretically 
could cause issues.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to