XiangpengHao commented on issue #10921:
URL: https://github.com/apache/datafusion/issues/10921#issuecomment-2177420089

   It seems that the current filter and parquet exec play nicely, and the 
following code will directly filter on the string view array (instead of 
converting to `StringArrary`), which is quite nice.
   
   (the test case requires the latest arrow-rs to run)
   ```rust
   
   #[tokio::test]
   async fn parquet_read_filter_string_view() {
       let tmp_dir = TempDir::new().unwrap();
   
       let values = vec![Some("small"), None, Some("Larger than 12 bytes 
array")];
       let c1: ArrayRef = Arc::new(StringViewArray::from_iter(values.iter()));
       let c2: ArrayRef = Arc::new(StringArray::from_iter(values.iter()));
   
       let batch =
           RecordBatch::try_from_iter(vec![("c1", c1.clone()), ("c2", 
c2.clone())]).unwrap();
   
       let file_name = {
           let table_dir = tmp_dir.path().join("parquet_test");
           std::fs::create_dir(&table_dir).unwrap();
           let file_name = table_dir.join("part-0.parquet");
           let mut writer = ArrowWriter::try_new(
               fs::File::create(&file_name).unwrap(),
               batch.schema(),
               None,
           )
           .unwrap();
           writer.write(&batch).unwrap();
           writer.close().unwrap();
           file_name
       };
   
       let ctx = SessionContext::new();
       ctx.register_parquet("t", file_name.to_str().unwrap(), 
Default::default())
           .await
           .unwrap();
   
       async fn display_result(sql: &str, ctx: &SessionContext) {
           let result = ctx.sql(sql).await.unwrap().collect().await.unwrap();
   
           arrow::util::pretty::print_batches(&result).unwrap();
   
           for b in result {
               println!("schema: {:?}", b.schema());
           }
       }
   
       display_result("SELECT * from t", &ctx).await;
       display_result("SELECT * from t where c1 <> 'small'", &ctx).await;
       display_result("SELECT * from t where c2 <> 'small'", &ctx).await;
   }
   ```
   
   I'll take a closer look at the generated logical/physical plans to verify 
that the string view array is never being converted to string array. And if 
that is the case, the remaining work of this issue is probably (1) add an 
option to force reading StringArray as StringView array, and (2) add more tests 
and potentially test the generated plan uses StringViewArray consistently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to