westonpace commented on PR #12916: URL: https://github.com/apache/arrow/pull/12916#issuecomment-1105722249
> Right now it only specifies the types, but Substrait currently doesn't specify a way for YAML files to refer to each other, so unless support for that is added I'm assuming we will figure out how to support this. We also need to be able to go the other way so we can add kernels to existing functions. > how would you determine if two generalized URIs refer to the same file? I'm not sure I understand this point. A single file shouldn't have different URIs referring to it. While that might be ok for URIs in general, we are using them as namespaces. We already have this constraint. If a producer asks for `file:///foo/bar/types.yaml#my_type` and the consumer only knows about `file:///foo/../foo/bar/types.yaml#my_type` then there will be a problem. > I suppose you could also argue that the YAML file defines what a generalized Arrow Substrait consumer should support. That is what I am arguing. Arrow is a spec and it has a type system. There should be a single URI (and ideally a single file) used to namespace all types that are in the Arrow spec but not in the Substrait spec. I think this file should also contain kernels for all of the standard Substrait functions (e.g. `add_uint8_uint8`). For example, [DuckDb supports Arrow's type system wants a way to express unsigned integers](https://github.com/substrait-io/substrait/discussions/2#discussioncomment-2559648). It would seem a tricky balance to ensure that DuckDb and Arrow are using the exact same URIs for uint8 if they each maintained their own definitions for this. I agree this means the file would not be automatically generated from any particular implementation. I agree this leads to a bit of tedious work and we will need "spec verification tests" to make sure we support the spec names. However, I don't know how you can have all Arrow consumers be compatible without doing this tedious work and that seems like a worthwhile goal. > In that case, though, [ARROW-15535](https://issues.apache.org/jira/browse/ARROW-15535) should be closed and... I disagree. I don't think every Arrow consumer will be identical. I think we will eventually end up with three levels of "specification" * The Substrait spec: This is the most common and smallest subset of functionality. * The Substrait-Arrow spec: Extends the Substrait spec with all types in the Arrow type system. Shared across consumers & producers that are willing to support Arrow. * The Substrait-Arrow-C++-Impl spec: This is an experimental proving ground for new functionality. Should be automatically generated from the code. It will be primarily used by functional tests of the Arrow consumer but a few functions might be used by producers that are tightly coupled to Arrow (e.g. pyarrow or arrow-r) to expose new functionality early before it has gone through the specification process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
