[ 
https://issues.apache.org/jira/browse/ARROW-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeroen van Straten updated ARROW-16988:
---------------------------------------
    Summary: [C++] Introduce Substrait ToProto/FromProto conversion options  
(was: Introduce Substrait ToProto/FromProto conversion options)

> [C++] Introduce Substrait ToProto/FromProto conversion options
> --------------------------------------------------------------
>
>                 Key: ARROW-16988
>                 URL: https://issues.apache.org/jira/browse/ARROW-16988
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jeroen van Straten
>            Assignee: Jeroen van Straten
>            Priority: Major
>              Labels: pull-request-available, substrait
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The goal of ARROW-16860 and in general one of the goals of the Substrait 
> consumer effort thus far, is to enable round-tripping between Substrait and 
> Acero plans. However, this begs the question what constitutes round-tripping: 
> are we talking about a perfect reproduction of a Substrait plan after 
> converting it to and from Acero (and/or vice-versa?), or we just talking 
> about functionally-equivalent plans, or is it something in between?
> This is kind of a rhethorical question because I think it depends on the use 
> case. We've been doing the former thus far to help prove correctness, but 
> this has various problems. For example:
> * Substrait plans contain meaningless information that cannot be represented 
> in Acero, such as the order in which extensions are defined or the anchors 
> used to refer to them. Plans are functionally and structurally 
> indistinguishable even if this information is lost.
> * Protobuf itself also contains meaningless information, because the order in 
> which fields are defined on the wire is undefined, and not even consistent 
> between serializations (hence the existence of 
> [CheckMessagesEquivalent|https://github.com/apache/arrow/blob/2a2d01d70e4e93cad07562f7df9c5d5ccf8e9840/cpp/src/arrow/engine/substrait/serde.h#L196-L208]).
> * (I'm guessing) Acero plans also contain functionally meaningless 
> information (like intermediate column names) that Substrait cannot represent, 
> at least not without advanced extensions.
> * The Substrait and Arrow type systems are quite different; tracking the 
> conversion between them in a way that loses no (meta-)information is 
> difficult. For example, Acero always encodes field names in schemas, while 
> Substrait only does this at the input and output.
> * Substrait and Acero deal with projections and expressions in fundamentally 
> different ways (see ARROW-16986).
> The approach thus far has been to just reject an incoming plan if it contains 
> something that can't be round-tripped exactly (at least according to 
> CheckMessagesEquivalent), but this behavior is far too pedantic to be useful 
> in practice, since it rejects perfectly valid and executable plans. For 
> example, optimizations (hints) specified in advanced extensions [can be 
> freely 
> ignored|https://github.com/substrait-io/substrait/blob/a79eb07a15cfa157e795f028a83f746967c98805/proto/substrait/extensions/extensions.proto#L75-L77]
>  but are currently rejected.
> Rather than trying to answer this question, I'd suggest adding a method to 
> specify conversion options. Initially I suggest a single enum with the 
> following variants:
> * pedantic conversion: reject plans that are known to not round-trip even if 
> they are valid.
> * structure-preserving conversion: accept plans even if they won't 
> round-trip, but preserve the relation structure of the incoming plan 
> completely.
> * best-effort conversion: accept plans even if they won't round-trip, and 
> avoid regressions in terms of optimality of the plan caused by the 
> conversion, potentially changing the relation structure, thus allowing for 
> "optimizations" like ARROW-16986.
> The enum should be in an options struct though, so more options can be added 
> later without having to add more arguments to the conversion functions (see 
> also ARROW-16987).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to