Jeroen van Straten created ARROW-16988:
------------------------------------------
Summary: Introduce Substrait ToProto/FromProto conversion options
Key: ARROW-16988
URL: https://issues.apache.org/jira/browse/ARROW-16988
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Jeroen van Straten
The goal of ARROW-16860 and in general one of the goals of the Substrait
consumer effort thus far, is to enable round-tripping between Substrait and
Acero plans. However, this begs the question what constitutes round-tripping:
are we talking about a perfect reproduction of a Substrait plan after
converting it to and from Acero (and/or vice-versa?), or we just talking about
functionally-equivalent plans, or is it something in between?
This is kind of a rhethorical question because I think it depends on the use
case. We've been doing the former thus far to help prove correctness, but this
has various problems. For example:
* Substrait plans contain meaningless information that cannot be represented in
Acero, such as the order in which extensions are defined or the anchors used to
refer to them. Plans are functionally and structurally indistinguishable even
if this information is lost.
* Protobuf itself also contains meaningless information, because the order in
which fields are defined on the wire is undefined, and not even consistent
between serializations (hence the existence of
[CheckMessagesEquivalent|https://github.com/apache/arrow/blob/2a2d01d70e4e93cad07562f7df9c5d5ccf8e9840/cpp/src/arrow/engine/substrait/serde.h#L196-L208]).
* (I'm guessing) Acero plans also contain functionally meaningless information
(like intermediate column names) that Substrait cannot represent, at least not
without advanced extensions.
* The Substrait and Arrow type systems are quite different; tracking the
conversion between them in a way that loses no (meta-)information is difficult.
For example, Acero always encodes field names in schemas, while Substrait only
does this at the input and output.
* Substrait and Acero deal with projections and expressions in fundamentally
different ways (see ARROW-16986).
The approach thus far has been to just reject an incoming plan if it contains
something that can't be round-tripped exactly (at least according to
CheckMessagesEquivalent), but this behavior is far too pedantic to be useful in
practice, since it rejects perfectly valid and executable plans. For example,
optimizations (hints) specified in advanced extensions [can be freely
ignored|https://github.com/substrait-io/substrait/blob/a79eb07a15cfa157e795f028a83f746967c98805/proto/substrait/extensions/extensions.proto#L75-L77]
but are currently rejected.
Rather than trying to answer this question, I'd suggest adding a method to
specify conversion options. Initially I suggest a single enum with the
following variants:
* pedantic conversion: reject plans that are known to not round-trip even if
they are valid.
* structure-preserving conversion: accept plans even if they won't round-trip,
but preserve the relation structure of the incoming plan completely.
* best-effort conversion: accept plans even if they won't round-trip, and avoid
regressions in terms of optimality of the plan caused by the conversion,
potentially changing the relation structure, thus allowing for "optimizations"
like ARROW-16986.
The enum should be in an options struct though, so more options can be added
later without having to add more arguments to the conversion functions (see
also ARROW-16987).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)