jvanstraten commented on code in PR #13537: URL: https://github.com/apache/arrow/pull/13537#discussion_r920425103
########## cpp/src/arrow/engine/substrait/options.h: ########## @@ -0,0 +1,55 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// This API is EXPERIMENTAL. + +#pragma once + +namespace arrow { +namespace engine { + +/// How strictly to adhere to the input structure when converting between Substrait and +/// Acero representations of a plan. This allows the user to trade conversion accuracy +/// for performance and lenience. +enum class ConversionStrictness { + /// Prevent information loss by rejecting incoming plans that use features or contain + /// metadata that cannot be exactly represented in the output format in a way that + /// will round-trip. Relations/nodes must map one-to-one. + PEDANTIC, + + /// When an incoming plan uses a feature that cannot be exactly represented in the + /// output format, attempt to emulate that feature as opposed to immediately + /// rejecting the plan. For example, a Substrait SortRel with a complex sort key + /// expression may be emulated using a project-order-project triple. Relations/nodes + /// will thus map one-to-many. + PRESERVE_STRUCTURE, + + /// Attempt to prevent performance-related regressions caused by differences in how + /// operations are represented in the input and output format, by allowing for + /// optimizations that cross structural boundaries. For example, the converter may + /// collapse chains of project nodes into one. Review Comment: > Isn't collapsing of a chain of project nodes more of an optimization than conversion concern? The collapsed node should be strictly equivalent to the original chain AFAIU. My reasoning for doing this is stated much more clearly in the JIRA issue, but tl;dr, the problem is that Substrait and Acero differ to such an extent that you can't map one-to-one without loss of information in basically any practical use case. So, whenever a feature is used at the input that doesn't have an exact match at the output the option comes into play: - PEDANTIC: reject the conversion; assert that there is no (known) information loss in the conversion. Plans should either round-trip back and forth exactly or not at all. Useful for testing and maybe debugging, but not really in practice. Note that this is the current behavior. - PRESERVE_STRUCTURE: for every individual primitive in the incoming plan (node, expression, whatever) that can't be represented exactly in the output format, map to some collection of primitives instead to model the behavior, and only fail if this is also not possible. Even if the incoming plan is completely optimal, the returned plan probably won't be because of this expansion. Roundtrips back and forth will likely make the plan increasingly suboptimal. However, you could hypothetically add debug information to the plan a la gcc -O0 -g to trace every primitive back to exactly one primitive in the original plan. - BEST_EFFORT: like PRESERVE_STRUCTURE, but prefer performance over structural accuracy; attempt to not regress in terms of plan performance. If the incoming plan was already aggressively optimized, the goal is for the output plan to not be substantially less performant. My reasoning for making the latter part of the conversion is roughly: - Some of these "optimizations" are very easy and performant to do while converting. Conversely, doing this afterwards requires building up an intermediate tree during conversion and then going over that tree again later. In fact, in some cases I've found it easier and less prone to dealing with special cases than exactly preserving structure. - I foresee (hope for) a generic Substrait optimization engine. It'd be a shame if we can't leverage that, because the moment we convert from Substrait to Acero we immediately lose out on performance for trivial reasons. That being said, if Acero gets its own optimizer at some point this option won't be as useful anymore. I also don't intend for BEST_EFFORT to do any fancy optimizations; in particular, anything that involves more than the tree traversal we already have to do for the conversion doesn't belong there. That is, indeed, the job of an optimization engine. As such, I've formulated the docs for the option such that it's perfectly permissible for BEST_EFFORT to have the same behavior as PRESERVE_STRUCTURE, in case we want to factor the logic into a smarter optimizer at some point. > is BEST_EFFORT always more lenient than PRESERVE_STRUCTURE? If by "lenient" you mean "how often it rejects a plan," with what I have in mind they'd be equal. However, I guess BEST_EFFORT could be more lenient, too. It certainly shouldn't be *less* lenient though. > does BEST_EFFORT also allow the converse (relations/nodes mapping many-to-one)? I'd say "many to one" is a subset of "many to many," so yes. The behavior I have in mind for it indeed also does that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
