Jeroen van Straten created ARROW-16988:
------------------------------------------

             Summary: Introduce Substrait ToProto/FromProto conversion options
                 Key: ARROW-16988
                 URL: https://issues.apache.org/jira/browse/ARROW-16988
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Jeroen van Straten


The goal of ARROW-16860 and in general one of the goals of the Substrait 
consumer effort thus far, is to enable round-tripping between Substrait and 
Acero plans. However, this begs the question what constitutes round-tripping: 
are we talking about a perfect reproduction of a Substrait plan after 
converting it to and from Acero (and/or vice-versa?), or we just talking about 
functionally-equivalent plans, or is it something in between?

This is kind of a rhethorical question because I think it depends on the use 
case. We've been doing the former thus far to help prove correctness, but this 
has various problems. For example:

* Substrait plans contain meaningless information that cannot be represented in 
Acero, such as the order in which extensions are defined or the anchors used to 
refer to them. Plans are functionally and structurally indistinguishable even 
if this information is lost.
* Protobuf itself also contains meaningless information, because the order in 
which fields are defined on the wire is undefined, and not even consistent 
between serializations (hence the existence of 
[CheckMessagesEquivalent|https://github.com/apache/arrow/blob/2a2d01d70e4e93cad07562f7df9c5d5ccf8e9840/cpp/src/arrow/engine/substrait/serde.h#L196-L208]).
* (I'm guessing) Acero plans also contain functionally meaningless information 
(like intermediate column names) that Substrait cannot represent, at least not 
without advanced extensions.
* The Substrait and Arrow type systems are quite different; tracking the 
conversion between them in a way that loses no (meta-)information is difficult. 
For example, Acero always encodes field names in schemas, while Substrait only 
does this at the input and output.
* Substrait and Acero deal with projections and expressions in fundamentally 
different ways (see ARROW-16986).

The approach thus far has been to just reject an incoming plan if it contains 
something that can't be round-tripped exactly (at least according to 
CheckMessagesEquivalent), but this behavior is far too pedantic to be useful in 
practice, since it rejects perfectly valid and executable plans. For example, 
optimizations (hints) specified in advanced extensions [can be freely 
ignored|https://github.com/substrait-io/substrait/blob/a79eb07a15cfa157e795f028a83f746967c98805/proto/substrait/extensions/extensions.proto#L75-L77]
 but are currently rejected.

Rather than trying to answer this question, I'd suggest adding a method to 
specify conversion options. Initially I suggest a single enum with the 
following variants:

* pedantic conversion: reject plans that are known to not round-trip even if 
they are valid.
* structure-preserving conversion: accept plans even if they won't round-trip, 
but preserve the relation structure of the incoming plan completely.
* best-effort conversion: accept plans even if they won't round-trip, and avoid 
regressions in terms of optimality of the plan caused by the conversion, 
potentially changing the relation structure, thus allowing for "optimizations" 
like ARROW-16986.

The enum should be in an options struct though, so more options can be added 
later without having to add more arguments to the conversion functions (see 
also ARROW-16987).




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to