goldmedal commented on issue #12788:
URL: https://github.com/apache/datafusion/issues/12788#issuecomment-2402885802

   > BTW thinking more about this, I do think we need to support the cast, but 
in this PR we should effectively change the _file_ schema (not just the table 
schema) when we setup the parquet reader (specifically with 
[`ArrowReaderOptions::with_schema`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_schema))
   > 
   
   This idea sounds great. It seems if we can apply the new schema when reading 
file, we can save one time casting. Just read as string.
   
   I tried to follow the implementation of StringView to apply the new schema 
using `with_schema` but I got casting error.
   ```
   Parquet error: Arrow: incompatible arrow schema, the following fields could 
not be cast:
   ```
   
   I can reprodcue this error on the arrow-rs side by added a test case in 
`parquet/src/arrow/arrow_reader/mod.rs`
   ```rust
       #[test]
       fn test_cast_binary_utf8() {
           let original_fields = Fields::from(vec![
               Field::new("binary_to_utf8", ArrowDataType::Binary, false),
           ]);
           let file = write_parquet_from_iter(vec![
               (
                   "binary_to_utf8",
                   Arc::new(BinaryArray::from(vec![b"one".as_ref(), 
b"two".as_ref()])) as ArrayRef,
               ),
               ]);
           let supplied_fields = Fields::from(vec![
               Field::new("binary_to_utf8", ArrowDataType::Utf8, false),
           ]);
   
           let options = 
ArrowReaderOptions::new().with_schema(Arc::new(Schema::new(supplied_fields)));
           let mut arrow_reader = 
ParquetRecordBatchReaderBuilder::try_new_with_options(
               file.try_clone().unwrap(),
               options,
           )
           .expect("reader builder with schema")
           .build()
           .expect("reader with schema");
   
           let batch = arrow_reader.next().unwrap().unwrap();
           assert_eq!(batch.num_columns(), 1);
           assert_eq!(batch.num_rows(), 2);
           assert_eq!(
               batch
                   .column(0)
                   .as_any()
                   .downcast_ref::<StringArray>()
                   .expect("downcast to string")
                   .iter()
                   .collect::<Vec<_>>(),
               vec![Some("one"), Some("two")]
           );
       }
   ```
   The output is
   ```
   reader builder with schema: ArrowError("incompatible arrow schema, the 
following fields could not be cast: [binary_to_utf8]")
   ```
   
   I tired to fix it through adding more pattern match at 
   
https://github.com/apache/arrow-rs/blob/5508978a3c5c4eb65ef6410e097887a8adaba38a/parquet/src/arrow/schema/primitive.rs#L40
   
   ```rust
           (DataType::Binary, DataType::Utf8) => hint,
   ```
   It can work well but I'm not pretty sure if this way makes sense 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to