kadinrabo opened a new issue, #19944: URL: https://github.com/apache/datafusion/issues/19944
DataFusion encodes Arrow-specific types (like unsigned integers) by misusing [`type_variation_reference`](https://github.com/apache/datafusion/blob/e6fc5160312481f7df8da3d69321350f81238e78/datafusion/substrait/src/logical_plan/producer/types.rs#L68-L77). This violates Substrait's [technology principle](https://substrait.io/spec/technology_principles/) to avoid specialization for a single producer. Per the [spec](https://substrait.io/types/type_variations/), `type_variation_reference` is for physical variations of the same type where "all variations are expected to have the same semantics." Signed and unsigned integers have different semantics. Types affected: - UInt8/16/32/64 - LargeUtf8/LargeBinary/LargeList - Decimal256 - Duration - Date64 - Time32 - Time64 ## Solution Use Arrow's official [extension_types.yaml](https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml) which already defines these types (u8, u16, large_string, decimal256, etc.). **Before:** ``` Type::I8 { type_variation_reference: 1 } // means UInt8 ``` **After:** ``` extension_uris: [{ uri: ".../extension_types.yaml" }] Type::UserDefined { name: "u8" } ``` The consumer already handles extension types, so backwards compatibility can be maintained. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
