adriangb opened a new issue, #13574:
URL: https://github.com/apache/datafusion/issues/13574
### Describe the bug
Tested on a parquet file with a single `Dictionary(UInt32, Utf8)` column and
bloom filters, datafusion-cli won't use them to prune.
### To Reproduce
```rust
use arrow::array::RecordBatch;
use arrow_schema::DataType;
use bytes::{BufMut, BytesMut};
use datafusion::datasource::schema_adapter::DefaultSchemaAdapterFactory;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use std::sync::Arc;
async fn write_record_batch(data_type: &DataType, suffix: &str) {
let schema =
Arc::new(arrow::datatypes::Schema::new(vec![arrow::datatypes::Field::new(
"column",
data_type.clone(),
false,
)]));
let batch = RecordBatch::try_new(
Arc::new(arrow::datatypes::Schema::new(vec![arrow::datatypes::Field::new(
"column",
DataType::Utf8,
false,
)])),
vec![Arc::new(arrow::array::StringArray::from(vec!["Hello,
World!"]))],
)
.unwrap();
let batch = DefaultSchemaAdapterFactory::from_schema(schema)
.map_schema(batch.schema().as_ref())
.unwrap()
.0
.map_batch(batch)
.unwrap();
let mut buf = BytesMut::new().writer();
let schema = batch.schema();
let props = WriterProperties::builder()
.set_bloom_filter_enabled(true)
.set_statistics_enabled(parquet::file::properties::EnabledStatistics::None)
.build();
{
let mut writer = ArrowWriter::try_new(&mut buf, schema,
Some(props)).unwrap();
writer.write(&batch).unwrap();
writer.finish().unwrap();
}
tokio::fs::write(format!("hello_world{}.parquet", suffix),
buf.into_inner().freeze())
.await
.unwrap();
}
#[tokio::main]
async fn main() {
// Create a RecordBatch with a single string column with a single row
containing "Hello, World!"
let data_type = arrow::datatypes::DataType::Utf8;
write_record_batch(&data_type, "_plain").await;
let data_type = arrow::datatypes::DataType::Dictionary(
Box::new(arrow::datatypes::DataType::Int32),
Box::new(arrow::datatypes::DataType::Utf8),
);
write_record_batch(&data_type, "_dict").await;
let data_type = arrow::datatypes::DataType::Utf8View;
write_record_batch(&data_type, "_view").await;
}
```
You can now verify with `parquet bloom-filter -c column -v 'Not Hello,
World!' hello_world_dict.parquet` that the value can be pruned via bloom
fitlers.
But querying it with datafusion-cli using `explain analyze select * from
'hello_world_dict.parquet' where column = 'Not Hellow, World!';` confirms that
bloom filters aren't used. They are used for `_plain` or `_view`.
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]