korowa opened a new issue, #13821: URL: https://github.com/apache/datafusion/issues/13821
### Describe the bug Parquet RowGroup pruning by statistics works incorrectly for `Dictionary(Decimal)` type. ### To Reproduce ```rs use arrow; use arrow::array::{ArrayRef, Decimal128Array, DictionaryArray, Int32Array, RecordBatch}; use datafusion::error::Result; use datafusion::prelude::*; use parquet::arrow::ArrowWriter; use parquet::file::properties::{EnabledStatistics, WriterProperties}; use std::fs::File; use std::sync::Arc; #[tokio::main] async fn main() -> Result<()> { // Prepare record batch let array_values = Decimal128Array::from_iter_values(vec![10, 20, 30]) .with_precision_and_scale(4, 1)?; let array_keys = Int32Array::from_iter_values(vec![0, 1, 2]); let array = Arc::new(DictionaryArray::new(array_keys, Arc::new(array_values))); let batch = RecordBatch::try_from_iter(vec![("col", array as ArrayRef)])?; // Write batch to parquet let file_path = "dictionary_decimal.parquet"; let file = File::create(file_path)?; let properties = WriterProperties::builder() .set_statistics_enabled(EnabledStatistics::Chunk) .set_bloom_filter_enabled(true) .build(); let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(properties))?; writer.write(&batch)?; writer.flush()?; writer.close()?; // Prepare context let config = SessionConfig::default() .with_parquet_bloom_filter_pruning(true) .with_collect_statistics(true); let ctx = SessionContext::new_with_config(config); ctx.register_parquet("t", file_path, ParquetReadOptions::default()) .await?; // In case pruning predicate not created (due to cast), there is a record in resultset ctx.sql("select * from t where col = 1") .await? .show() .await?; println!(); // In case of triggered RowGroup pruning -- the only RowGroup eliminated while pruning by statistics ctx.sql("select * from t where col = cast(1 as decimal(4, 1))") .await? .show() .await?; Ok(()) } ``` ### Expected behavior Results from both queries from the script above should match ### Additional context The problem also happens with bloom filters (if enable them in pattern matching expressions in `prune_by_bloom_filters`), so there is a chance that `ArrowWriter` produces incorrect metadata (statistics / bloom filters). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org