korowa opened a new issue, #13821:
URL: https://github.com/apache/datafusion/issues/13821
### Describe the bug
Parquet RowGroup pruning by statistics works incorrectly for
`Dictionary(Decimal)` type.
### To Reproduce
```rs
use arrow;
use arrow::array::{ArrayRef, Decimal128Array, DictionaryArray, Int32Array,
RecordBatch};
use datafusion::error::Result;
use datafusion::prelude::*;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::{EnabledStatistics, WriterProperties};
use std::fs::File;
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<()> {
// Prepare record batch
let array_values = Decimal128Array::from_iter_values(vec![10, 20, 30])
.with_precision_and_scale(4, 1)?;
let array_keys = Int32Array::from_iter_values(vec![0, 1, 2]);
let array = Arc::new(DictionaryArray::new(array_keys,
Arc::new(array_values)));
let batch = RecordBatch::try_from_iter(vec![("col", array as
ArrayRef)])?;
// Write batch to parquet
let file_path = "dictionary_decimal.parquet";
let file = File::create(file_path)?;
let properties = WriterProperties::builder()
.set_statistics_enabled(EnabledStatistics::Chunk)
.set_bloom_filter_enabled(true)
.build();
let mut writer = ArrowWriter::try_new(file, batch.schema(),
Some(properties))?;
writer.write(&batch)?;
writer.flush()?;
writer.close()?;
// Prepare context
let config = SessionConfig::default()
.with_parquet_bloom_filter_pruning(true)
.with_collect_statistics(true);
let ctx = SessionContext::new_with_config(config);
ctx.register_parquet("t", file_path, ParquetReadOptions::default())
.await?;
// In case pruning predicate not created (due to cast), there is a
record in resultset
ctx.sql("select * from t where col = 1")
.await?
.show()
.await?;
println!();
// In case of triggered RowGroup pruning -- the only RowGroup eliminated
while pruning by statistics
ctx.sql("select * from t where col = cast(1 as decimal(4, 1))")
.await?
.show()
.await?;
Ok(())
}
```
### Expected behavior
Results from both queries from the script above should match
### Additional context
The problem also happens with bloom filters (if enable them in pattern
matching expressions in `prune_by_bloom_filters`), so there is a chance that
`ArrowWriter` produces incorrect metadata (statistics / bloom filters).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]