jvanstraten commented on code in PR #13537:
URL: https://github.com/apache/arrow/pull/13537#discussion_r920425103


##########
cpp/src/arrow/engine/substrait/options.h:
##########
@@ -0,0 +1,55 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// This API is EXPERIMENTAL.
+
+#pragma once
+
+namespace arrow {
+namespace engine {
+
+/// How strictly to adhere to the input structure when converting between 
Substrait and
+/// Acero representations of a plan. This allows the user to trade conversion 
accuracy
+/// for performance and lenience.
+enum class ConversionStrictness {
+  /// Prevent information loss by rejecting incoming plans that use features 
or contain
+  /// metadata that cannot be exactly represented in the output format in a 
way that
+  /// will round-trip. Relations/nodes must map one-to-one.
+  PEDANTIC,
+
+  /// When an incoming plan uses a feature that cannot be exactly represented 
in the
+  /// output format, attempt to emulate that feature as opposed to immediately
+  /// rejecting the plan. For example, a Substrait SortRel with a complex sort 
key
+  /// expression may be emulated using a project-order-project triple. 
Relations/nodes
+  /// will thus map one-to-many.
+  PRESERVE_STRUCTURE,
+
+  /// Attempt to prevent performance-related regressions caused by differences 
in how
+  /// operations are represented in the input and output format, by allowing 
for
+  /// optimizations that cross structural boundaries. For example, the 
converter may
+  /// collapse chains of project nodes into one.

Review Comment:
   > Isn't collapsing of a chain of project nodes more of an optimization than 
conversion concern? The collapsed node should be strictly equivalent to the 
original chain AFAIU.
   
   My reasoning for doing this is stated much more clearly in the JIRA issue, 
but tl;dr, the problem is that Substrait and Acero differ to such an extent 
that you can't map one-to-one without loss of information in basically any 
practical use case. So, whenever a feature is used at the input that doesn't 
have an exact match at the output the option comes into play:
   
    - PEDANTIC: reject the conversion; assert that there is no (known) 
information loss in the conversion. Plans should either round-trip back and 
forth exactly or not at all. Useful for testing and maybe debugging, but not 
really in practice. Note that this is the current behavior.
    - PRESERVE_STRUCTURE: for every individual primitive in the incoming plan 
(node, expression, whatever) that can't be represented exactly in the output 
format, map to some collection of primitives instead to model the behavior, and 
only fail if this is also not possible. Even if the incoming plan is completely 
optimal, the returned plan probably won't be because of this expansion. 
Roundtrips back and forth will likely make the plan increasingly suboptimal. 
However, you could hypothetically add debug information to the plan a la gcc 
-O0 -g to trace every primitive back to exactly one primitive in the original 
plan.
    - BEST_EFFORT: like PRESERVE_STRUCTURE, but prefer performance over 
structural accuracy; attempt to not regress in terms of plan performance. If 
the incoming plan was already aggressively optimized, the goal is for the 
output plan to not be substantially less performant.
   
   My reasoning for making the latter part of the conversion is roughly:
   
    - Some of these "optimizations" are very easy and performant to do while 
converting. Conversely, doing this afterwards requires building up an 
intermediate tree during conversion and then going over that tree again later. 
In fact, in some cases I've found it easier and less prone to dealing with 
special cases than exactly preserving structure.
    - I foresee (hope for) a generic Substrait optimization engine. It'd be a 
shame if we can't leverage that, because the moment we convert from Substrait 
to Acero we immediately lose out on performance for trivial reasons.
   
   That being said, if Acero gets its own optimizer at some point this option 
won't be as useful anymore. I also don't intend for BEST_EFFORT to do any fancy 
optimizations; in particular, anything that involves more than the tree 
traversal we already have to do for the conversion doesn't belong there. That 
is, indeed, the job of an optimization engine. As such, I've formulated the 
docs for the option such that it's perfectly permissible for BEST_EFFORT to 
have the same behavior as PRESERVE_STRUCTURE, in case we want to factor the 
logic into a smarter optimizer at some point.
   
   > is BEST_EFFORT always more lenient than PRESERVE_STRUCTURE?
   
   If by "lenient" you mean "how often it rejects a plan," with what I have in 
mind they'd be equal. However, I guess BEST_EFFORT could be more lenient, too. 
It certainly shouldn't be *less* lenient though.
   
   > does BEST_EFFORT also allow the converse (relations/nodes mapping 
many-to-one)?
   
   I'd say "many to one" is a subset of "many to many," so yes. The behavior I 
have in mind for it indeed also does that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to