Blizzara opened a new issue, #13437:
URL: https://github.com/apache/datafusion/issues/13437

   ### Describe the bug
   
   When constructing a DF plan from Substrait, we confirm that the data types 
of the columns DF sees matches what the Substrait plan expects. However, the 
check here can fail if the inner name of an List field differs:
   
   ```
   "Field 'categories' in Substrait schema has a different type (List(Field { 
name: \"item\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: 
false, metadata: {} })) than the corresponding field in the table schema 
(List(Field { name: \"element\", data_type: Utf8, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} })).
   
   backtrace:    0: <core::iter::adapters::GenericShunt<I,R> as 
core::iter::traits::iterator::Iterator>::next
      1: 
datafusion_substrait::logical_plan::consumer::ensure_schema_compatability
      2: 
datafusion_substrait::logical_plan::consumer::from_substrait_rel::{{closure}}
   ```
   
   This happens because we call the inner field "item" (specifically we 
[use](https://github.com/apache/datafusion/blob/8c352708d06c0bde7f9e92cda06efd69b50f16f0/datafusion/substrait/src/logical_plan/consumer.rs#L1946)
 Field::new_list_field which calls it "item"), but Arrow doesn't mandate that. 
And Spark's toArrowSchema uses "element" instead: 
https://github.com/apache/spark/blob/11e47064d0c73ab4fc6c960153845b45356db20f/sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala#L114
   
   For Map columns, Spark's toArrowSchema seems to use the same name as we use 
here, so it doesn't collide - but Arrow doesn't mandate that naming either so 
it's liable to fail for someone somewhere.
   
   I can think of at least two ways to fix this:
   a) normalize the names in consumer.rs before comparing
   b) normalize the names in dfschema.rs::datatype_is_logically_equal before 
comparing
   
   thoughts? cc @vbarua 
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to