XiangpengHao commented on issue #10921: URL: https://github.com/apache/datafusion/issues/10921#issuecomment-2177420089
It seems that the current filter and parquet exec play nicely, and the following code will directly filter on the string view array (instead of converting to `StringArrary`), which is quite nice. (the test case requires the latest arrow-rs to run) ```rust #[tokio::test] async fn parquet_read_filter_string_view() { let tmp_dir = TempDir::new().unwrap(); let values = vec![Some("small"), None, Some("Larger than 12 bytes array")]; let c1: ArrayRef = Arc::new(StringViewArray::from_iter(values.iter())); let c2: ArrayRef = Arc::new(StringArray::from_iter(values.iter())); let batch = RecordBatch::try_from_iter(vec![("c1", c1.clone()), ("c2", c2.clone())]).unwrap(); let file_name = { let table_dir = tmp_dir.path().join("parquet_test"); std::fs::create_dir(&table_dir).unwrap(); let file_name = table_dir.join("part-0.parquet"); let mut writer = ArrowWriter::try_new( fs::File::create(&file_name).unwrap(), batch.schema(), None, ) .unwrap(); writer.write(&batch).unwrap(); writer.close().unwrap(); file_name }; let ctx = SessionContext::new(); ctx.register_parquet("t", file_name.to_str().unwrap(), Default::default()) .await .unwrap(); async fn display_result(sql: &str, ctx: &SessionContext) { let result = ctx.sql(sql).await.unwrap().collect().await.unwrap(); arrow::util::pretty::print_batches(&result).unwrap(); for b in result { println!("schema: {:?}", b.schema()); } } display_result("SELECT * from t", &ctx).await; display_result("SELECT * from t where c1 <> 'small'", &ctx).await; display_result("SELECT * from t where c2 <> 'small'", &ctx).await; } ``` I'll take a closer look at the generated logical/physical plans to verify that the string view array is never being converted to string array. And if that is the case, the remaining work of this issue is probably (1) add an option to force reading StringArray as StringView array, and (2) add more tests and potentially test the generated plan uses StringViewArray consistently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org