westonpace opened a new issue, #12181:
URL: https://github.com/apache/datafusion/issues/12181

   ### Is your feature request related to a problem or challenge?
   
   I am trying to take a filter expression created by pyarrow and convert it 
into a filter expression for Datafusion to satisfy.  I am using Substrait to do 
this.  Everything works fine when I use the standard Substrait types.  However, 
when I use normal Arrow types that are not Substrait types (e.g. unsigned 
integers, large containers) I run into problems.
   
   It seems that arrow-cpp (admittedly, me, in this case) and datafusion have 
taken different approaches to handling these limitations.
   
   In arrow-cpp the types that expand or change the valid range of values (e.g. 
unsigned integers, large containers) are converted to extension types.  This 
process is documented in 
https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml
   
   In datafusion it appears these types are expected to use the nearest 
substrait match (e.g. signed integer, small container) with a type variation.
   
   ### Describe the solution you'd like
   
   I am admittedly biased (given I implemented one of the two disagreeing 
components) but I favor the extension types approach.  Type variations are 
defined in Substrait as this:
   
   > Type variations may be used to represent differences in representation 
between different consumers. For example, an engine might support dictionary 
encoding for a string, or could be using either a row-wise or columnar 
representation of a struct. All variations of a type are expected to have the 
same semantics when operated on by functions or other expressions.
   
   Given that definition, I do not think it is valid to say that an unsigned 
integer is a variation of a signed integer (they do not have the same outputs 
for all functions).  I do believe things like the view types and dictionary 
encoding are valid type variations.
   
   ### Describe alternatives you've considered
   
   The alternative would be to change arrow-cpp to also use type variations.  
Though I'd like some consensus from the Substrait community that this is a 
valid use of type variations before taking that approach.
   
   At the moment I am working around this issue by simply removing any 
non-standard types from the input schema (this works as long as the filter 
isn't referencing those types).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to