kadinrabo opened a new issue, #19944:
URL: https://github.com/apache/datafusion/issues/19944

   DataFusion encodes Arrow-specific types (like unsigned integers) by misusing 
[`type_variation_reference`](https://github.com/apache/datafusion/blob/e6fc5160312481f7df8da3d69321350f81238e78/datafusion/substrait/src/logical_plan/producer/types.rs#L68-L77).
 This violates Substrait's [technology 
principle](https://substrait.io/spec/technology_principles/) to avoid 
specialization for a single producer.
   
   Per the [spec](https://substrait.io/types/type_variations/), 
`type_variation_reference` is for physical variations of the same type where 
"all variations are expected to have the same semantics." Signed and unsigned 
integers have different semantics.
   
   Types affected:
   - UInt8/16/32/64
   - LargeUtf8/LargeBinary/LargeList
   - Decimal256
   - Duration
   - Date64
   - Time32
   - Time64
   
   ## Solution
   
   Use Arrow's official 
[extension_types.yaml](https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml)
 which already defines these types (u8, u16, large_string, decimal256, etc.).
   
   **Before:**
   ```
   Type::I8 { type_variation_reference: 1 }  // means UInt8
   ```
   
   **After:**
   ```
   extension_uris: [{ uri: ".../extension_types.yaml" }]
   Type::UserDefined { name: "u8" }
   ```
   
   The consumer already handles extension types, so backwards compatibility can 
be maintained.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to