[
https://issues.apache.org/jira/browse/ARROW-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeroen van Straten updated ARROW-16988:
---------------------------------------
Summary: [C++] Introduce Substrait ToProto/FromProto conversion options
(was: Introduce Substrait ToProto/FromProto conversion options)
> [C++] Introduce Substrait ToProto/FromProto conversion options
> --------------------------------------------------------------
>
> Key: ARROW-16988
> URL: https://issues.apache.org/jira/browse/ARROW-16988
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jeroen van Straten
> Assignee: Jeroen van Straten
> Priority: Major
> Labels: pull-request-available, substrait
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The goal of ARROW-16860 and in general one of the goals of the Substrait
> consumer effort thus far, is to enable round-tripping between Substrait and
> Acero plans. However, this begs the question what constitutes round-tripping:
> are we talking about a perfect reproduction of a Substrait plan after
> converting it to and from Acero (and/or vice-versa?), or we just talking
> about functionally-equivalent plans, or is it something in between?
> This is kind of a rhethorical question because I think it depends on the use
> case. We've been doing the former thus far to help prove correctness, but
> this has various problems. For example:
> * Substrait plans contain meaningless information that cannot be represented
> in Acero, such as the order in which extensions are defined or the anchors
> used to refer to them. Plans are functionally and structurally
> indistinguishable even if this information is lost.
> * Protobuf itself also contains meaningless information, because the order in
> which fields are defined on the wire is undefined, and not even consistent
> between serializations (hence the existence of
> [CheckMessagesEquivalent|https://github.com/apache/arrow/blob/2a2d01d70e4e93cad07562f7df9c5d5ccf8e9840/cpp/src/arrow/engine/substrait/serde.h#L196-L208]).
> * (I'm guessing) Acero plans also contain functionally meaningless
> information (like intermediate column names) that Substrait cannot represent,
> at least not without advanced extensions.
> * The Substrait and Arrow type systems are quite different; tracking the
> conversion between them in a way that loses no (meta-)information is
> difficult. For example, Acero always encodes field names in schemas, while
> Substrait only does this at the input and output.
> * Substrait and Acero deal with projections and expressions in fundamentally
> different ways (see ARROW-16986).
> The approach thus far has been to just reject an incoming plan if it contains
> something that can't be round-tripped exactly (at least according to
> CheckMessagesEquivalent), but this behavior is far too pedantic to be useful
> in practice, since it rejects perfectly valid and executable plans. For
> example, optimizations (hints) specified in advanced extensions [can be
> freely
> ignored|https://github.com/substrait-io/substrait/blob/a79eb07a15cfa157e795f028a83f746967c98805/proto/substrait/extensions/extensions.proto#L75-L77]
> but are currently rejected.
> Rather than trying to answer this question, I'd suggest adding a method to
> specify conversion options. Initially I suggest a single enum with the
> following variants:
> * pedantic conversion: reject plans that are known to not round-trip even if
> they are valid.
> * structure-preserving conversion: accept plans even if they won't
> round-trip, but preserve the relation structure of the incoming plan
> completely.
> * best-effort conversion: accept plans even if they won't round-trip, and
> avoid regressions in terms of optimality of the plan caused by the
> conversion, potentially changing the relation structure, thus allowing for
> "optimizations" like ARROW-16986.
> The enum should be in an options struct though, so more options can be added
> later without having to add more arguments to the conversion functions (see
> also ARROW-16987).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)