westonpace commented on PR #12916:
URL: https://github.com/apache/arrow/pull/12916#issuecomment-1105722249

   > Right now it only specifies the types, but Substrait currently doesn't 
specify a way for YAML files to refer to each other, so unless support for that 
is added
   
   I'm assuming we will figure out how to support this.  We also need to be 
able to go the other way so we can add kernels to existing functions.
   
   > how would you determine if two generalized URIs refer to the same file?
   
   I'm not sure I understand this point.  A single file shouldn't have 
different URIs referring to it.  While that might be ok for URIs in general, we 
are using them as namespaces.  We already have this constraint.  If a producer 
asks for `file:///foo/bar/types.yaml#my_type` and the consumer only knows about 
`file:///foo/../foo/bar/types.yaml#my_type` then there will be a problem.
   
   > I suppose you could also argue that the YAML file defines what a 
generalized Arrow Substrait consumer should support.
   
   That is what I am arguing.  Arrow is a spec and it has a type system.  There 
should be a single URI (and ideally a single file) used to namespace all types 
that are in the Arrow spec but not in the Substrait spec.  I think this file 
should also contain kernels for all of the standard Substrait functions (e.g. 
`add_uint8_uint8`).
   
   For example, [DuckDb supports Arrow's type system wants a way to express 
unsigned 
integers](https://github.com/substrait-io/substrait/discussions/2#discussioncomment-2559648).
  It would seem a tricky balance to ensure that DuckDb and Arrow are using the 
exact same URIs for uint8 if they each maintained their own definitions for 
this.
   
   I agree this means the file would not be automatically generated from any 
particular implementation.  I agree this leads to a bit of tedious work and we 
will need "spec verification tests" to make sure we support the spec names.  
However, I don't know how you can have all Arrow consumers be compatible 
without doing this tedious work and that seems like a worthwhile goal.
   
   >  In that case, though, 
[ARROW-15535](https://issues.apache.org/jira/browse/ARROW-15535) should be 
closed and...
   
   I disagree.  I don't think every Arrow consumer will be identical.  I think 
we will eventually end up with three levels of "specification"
   
   * The Substrait spec: This is the most common and smallest subset of 
functionality.
   * The Substrait-Arrow spec: Extends the Substrait spec with all types in the 
Arrow type system.  Shared across consumers & producers that are willing to 
support Arrow.
   * The Substrait-Arrow-C++-Impl spec: This is an experimental proving ground 
for new functionality.  Should be automatically generated from the code.  It 
will be primarily used by functional tests of the Arrow consumer but a few 
functions might be used by producers that are tightly coupled to Arrow (e.g. 
pyarrow or arrow-r) to expose new functionality early before it has gone 
through the specification process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to