korowa opened a new issue, #13821:
URL: https://github.com/apache/datafusion/issues/13821

   ### Describe the bug
   
   Parquet RowGroup pruning by statistics works incorrectly for 
`Dictionary(Decimal)` type.
   
   ### To Reproduce
   
   ```rs
   use arrow;
   use arrow::array::{ArrayRef, Decimal128Array, DictionaryArray, Int32Array, 
RecordBatch};
   use datafusion::error::Result;
   use datafusion::prelude::*;
   use parquet::arrow::ArrowWriter;
   use parquet::file::properties::{EnabledStatistics, WriterProperties};
   use std::fs::File;
   use std::sync::Arc;
   
   #[tokio::main]
   async fn main() -> Result<()> {
       // Prepare record batch
       let array_values = Decimal128Array::from_iter_values(vec![10, 20, 30])
           .with_precision_and_scale(4, 1)?;
       let array_keys = Int32Array::from_iter_values(vec![0, 1, 2]);
       let array = Arc::new(DictionaryArray::new(array_keys, 
Arc::new(array_values)));
       let batch = RecordBatch::try_from_iter(vec![("col", array as 
ArrayRef)])?;
   
       // Write batch to parquet
       let file_path = "dictionary_decimal.parquet";
   
       let file = File::create(file_path)?;
       let properties = WriterProperties::builder()
           .set_statistics_enabled(EnabledStatistics::Chunk)
           .set_bloom_filter_enabled(true)
           .build();
       let mut writer = ArrowWriter::try_new(file, batch.schema(), 
Some(properties))?;
   
       writer.write(&batch)?;
       writer.flush()?;
       writer.close()?;
   
       // Prepare context
       let config = SessionConfig::default()
           .with_parquet_bloom_filter_pruning(true)
           .with_collect_statistics(true);
       let ctx = SessionContext::new_with_config(config);
   
       ctx.register_parquet("t", file_path, ParquetReadOptions::default())
           .await?;
   
       // In case pruning predicate not created (due to cast), there is a 
record in resultset
       ctx.sql("select * from t where col = 1")
           .await?
           .show()
           .await?;
   
       println!();
   
       // In case of triggered RowGroup pruning -- the only RowGroup eliminated 
while pruning by statistics
       ctx.sql("select * from t where col = cast(1 as decimal(4, 1))")
           .await?
           .show()
           .await?;
   
       Ok(())
   }
   
   ```
   
   ### Expected behavior
   
   Results from both queries from the script above should match
   
   ### Additional context
   
   The problem also happens with bloom filters (if enable them in pattern 
matching expressions in `prune_by_bloom_filters`), so there is a chance that 
`ArrowWriter` produces incorrect metadata (statistics / bloom filters).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to