XiangpengHao commented on issue #10921:
URL: https://github.com/apache/datafusion/issues/10921#issuecomment-2177420089
It seems that the current filter and parquet exec play nicely, and the
following code will directly filter on the string view array (instead of
converting to `StringArrary`), which is quite nice.
(the test case requires the latest arrow-rs to run)
```rust
#[tokio::test]
async fn parquet_read_filter_string_view() {
let tmp_dir = TempDir::new().unwrap();
let values = vec![Some("small"), None, Some("Larger than 12 bytes
array")];
let c1: ArrayRef = Arc::new(StringViewArray::from_iter(values.iter()));
let c2: ArrayRef = Arc::new(StringArray::from_iter(values.iter()));
let batch =
RecordBatch::try_from_iter(vec![("c1", c1.clone()), ("c2",
c2.clone())]).unwrap();
let file_name = {
let table_dir = tmp_dir.path().join("parquet_test");
std::fs::create_dir(&table_dir).unwrap();
let file_name = table_dir.join("part-0.parquet");
let mut writer = ArrowWriter::try_new(
fs::File::create(&file_name).unwrap(),
batch.schema(),
None,
)
.unwrap();
writer.write(&batch).unwrap();
writer.close().unwrap();
file_name
};
let ctx = SessionContext::new();
ctx.register_parquet("t", file_name.to_str().unwrap(),
Default::default())
.await
.unwrap();
async fn display_result(sql: &str, ctx: &SessionContext) {
let result = ctx.sql(sql).await.unwrap().collect().await.unwrap();
arrow::util::pretty::print_batches(&result).unwrap();
for b in result {
println!("schema: {:?}", b.schema());
}
}
display_result("SELECT * from t", &ctx).await;
display_result("SELECT * from t where c1 <> 'small'", &ctx).await;
display_result("SELECT * from t where c2 <> 'small'", &ctx).await;
}
```
I'll take a closer look at the generated logical/physical plans to verify
that the string view array is never being converted to string array. And if
that is the case, the remaining work of this issue is probably (1) add an
option to force reading StringArray as StringView array, and (2) add more tests
and potentially test the generated plan uses StringViewArray consistently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]