[
https://issues.apache.org/jira/browse/ARROW-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527868#comment-17527868
]
Weston Pace commented on ARROW-15582:
-------------------------------------
There are <100 "standard" Substrait functions right now but this list will
probably grow. In general I do not think it is safe to assume that Substrait
functions & Arrow functions will share the same name. Even if two functions do
exist with the same name I don't think it's safe to assume they will have the
same behavior. I think some kind of "mapping" object is going to have to be
maintained.
At a minimum one would think this mapping object would be a simple
bidirectional string:string map which goes from Arrow function name to
Substrait function name and back. Unfortunately, as the ticket describes, I do
not think this is possible today.
The worst case scenario is that we require two functions for every entry in the
mapping. One that goes from a Substrait "call" to an Arrow "call" and the
reverse. I think, as a first attempt, we should tackle this with a very manual
mapping, probably with some kind of convenience option for the functions that
are simple aliases and then we can look at how we improve from there.
A substrait "call" is a name (string), a vector of arguments (expressions), and
a vector of options (literal expressions). An arrow "call" is a name (string),
a vector of arguments (expressions), and an options object (POCO).
So my suggestion for the mapping would be something like...
{noformat}
using ArrowToSubstrait =
std::function<substrait::Expression::ScalarFunction(const
arrow::compute::Expression::Call&, std::vector<substrait::Expression>)>;
using SubstraitToArrow = std::function<arrow::compute::Expression::Call(const
substrait::Expression::ScalarFunction&)>;
class FunctionMapping {
// Registration API
AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait
conversion_func);
AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow
conversion_func);
// Usage API
substrait::Expression::ScalarFunction ToProto(const
arrow::compute::Expression::Call& call);
arrow::compute::Expression::Call FromProto(const
substrait::Expression::ScalarFunction& call);
};
{noformat}
The add function is an interesting example (some pseudo-code / imaginary helper
functions for brevity):
{noformat}
SubstraitToArrow substrait_add_to_arrow = [] (const
substrait::Expression::ScalarFunction& call) {
// Note, Substrait scalar functions don't distinguish between options and
arguments so the
// index of this option is 2 because it comes after the operands (at index 0
and 1).
// This is why we have to specify how many args there are in the GetArgs
invocation.
auto args = GetArgs(call, 2);
EnumLiteral overflow_handling = GetOption<EnumLiteral>(call, 2);
if (IsSpecified(overflow_handling)) {
switch (GetEnumValue(overflow_handling)) {
case "SILENT":
return call("add", args);
case "SATURATE":
return Status::Invalid("Arrow does not have a saturating add");
case "ERROR":
return call("add_checked", args);
}
} else {
// Default to unchecked add because SILENT => unchecked and SILENT
// is the first option in the enum (and thus the highest priority when
// not specified)
return call("add", args);
}
};
// Note, we can automatically do the conversion from arrow args to Substrait
args because
// we distinguish between args and options in Arrow.
ArrowToSubstrait arrow_add_to_substrait = [] (const
arrow::compute::Expression::Call& call, std::vector<substrait::Expression>
args) {
var overflow_behavior = MakeEnum("ERROR");
var all_args = Concat(std::move(args), {overflow_behavior});
return MakeSubstraitCall("add", std::move(all_args));
};
ArrowToSubstrait arrow_unchecked_add_to_substrait = [] (const
arrow::compute::Expression::Call& call, std::vector<substrait::Expression>
args) {
var overflow_behavior = MakeEnum("SILENT");
var all_args = Concat(std::move(args), {overflow_behavior});
return MakeSubstraitCall("add", std::move(all_args));
};
function_mapping.AddSubstraitToArrow("add", substrait_add_to_arrow);
function_mapping.AddArrowToSubstrait("add", arrow_add_to_substrait);
function_mapping.AddArrowToSubstrait("add_unchecked",
arrow_add_unchecked_to_substrait);
{noformat}
> [C++] Add support for registering tricky functions with the Substrait
> consumer (or add a bunch of substrait meta functions)
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-15582
> URL: https://issues.apache.org/jira/browse/ARROW-15582
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
> Labels: substrait
>
> Sometimes one Substrait function will map to multiple Arrow functions. For
> example, the Substrait {{add}} function might be referring to Arrow's {{add}}
> or {{add_checked}}. We need to figure out how to register this correctly
> (e.g. one possible approach would be a {{substrait_add}} meta function).
> Other times a substrait function will encode something Arrow considers an
> "option" as a function argument. For example, the is_in Arrow function is
> unary with an option for the lookup set. The substrait function is binary
> but the second argument must be constant and be the lookup set. Neither of
> which is to be confused with a truly binary is_in function which takes in a
> different set at every row.
> It's possible there is no work to do here other than adding a bunch of
> substrait_ meta functions in Arrow. In that case all the work will be done
> in other JIRAs. Or, it is possible that there is some kind of extension we
> can make to the function registry that bypasses the need for the meta
> functions. I'm leaving this JIRA open so future contributors can consider
> this second option.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)