[ 
https://issues.apache.org/jira/browse/ARROW-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527868#comment-17527868
 ] 

Weston Pace commented on ARROW-15582:
-------------------------------------

There are <100 "standard" Substrait functions right now but this list will 
probably grow.  In general I do not think it is safe to assume that Substrait 
functions & Arrow functions will share the same name.  Even if two functions do 
exist with the same name I don't think it's safe to assume they will have the 
same behavior.  I think some kind of "mapping" object is going to have to be 
maintained.

At a minimum one would think this mapping object would be a simple 
bidirectional string:string map which goes from Arrow function name to 
Substrait function name and back.  Unfortunately, as the ticket describes, I do 
not think this is possible today.

The worst case scenario is that we require two functions for every entry in the 
mapping.  One that goes from a Substrait "call" to an Arrow "call" and the 
reverse.  I think, as a first attempt, we should tackle this with a very manual 
mapping, probably with some kind of convenience option for the functions that 
are simple aliases and then we can look at how we improve from there.

A substrait "call" is a name (string), a vector of arguments (expressions), and 
a vector of options (literal expressions).  An arrow "call" is a name (string), 
a vector of arguments (expressions), and an options object (POCO).

So my suggestion for the mapping would be something like...

{noformat}
using ArrowToSubstrait = 
std::function<substrait::Expression::ScalarFunction(const 
arrow::compute::Expression::Call&, std::vector<substrait::Expression>)>;
using SubstraitToArrow = std::function<arrow::compute::Expression::Call(const 
substrait::Expression::ScalarFunction&)>;
class FunctionMapping {

  // Registration API
  AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait 
conversion_func);
  AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow 
conversion_func);

  // Usage API
  substrait::Expression::ScalarFunction ToProto(const 
arrow::compute::Expression::Call& call);
  arrow::compute::Expression::Call FromProto(const 
substrait::Expression::ScalarFunction& call);
};
{noformat}

The add function is an interesting example (some pseudo-code / imaginary helper 
functions for brevity):

{noformat}
SubstraitToArrow substrait_add_to_arrow = [] (const 
substrait::Expression::ScalarFunction& call)  {
  // Note, Substrait scalar functions don't distinguish between options and 
arguments so the
  // index of this option is 2 because it comes after the operands (at index 0 
and 1).
  // This is why we have to specify how many args there are in the GetArgs 
invocation.
  auto args = GetArgs(call, 2);
  EnumLiteral overflow_handling = GetOption<EnumLiteral>(call, 2);
  if (IsSpecified(overflow_handling)) {
    switch (GetEnumValue(overflow_handling)) {
      case "SILENT":
        return call("add", args);
      case "SATURATE":
        return Status::Invalid("Arrow does not have a saturating add");
      case "ERROR":
        return call("add_checked", args);
    }
  } else {
    // Default to unchecked add because SILENT => unchecked and SILENT
    // is the first option in the enum (and thus the highest priority when
    // not specified)
    return call("add", args);
  }
};
// Note, we can automatically do the conversion from arrow args to Substrait 
args because
// we distinguish between args and options in Arrow.
ArrowToSubstrait arrow_add_to_substrait = [] (const 
arrow::compute::Expression::Call& call, std::vector<substrait::Expression> 
args) {
  var overflow_behavior = MakeEnum("ERROR");
  var all_args = Concat(std::move(args), {overflow_behavior});
  return MakeSubstraitCall("add", std::move(all_args));
};
ArrowToSubstrait arrow_unchecked_add_to_substrait = [] (const 
arrow::compute::Expression::Call& call, std::vector<substrait::Expression> 
args) {
  var overflow_behavior = MakeEnum("SILENT");
  var all_args = Concat(std::move(args), {overflow_behavior});
  return MakeSubstraitCall("add", std::move(all_args));
};
function_mapping.AddSubstraitToArrow("add", substrait_add_to_arrow);
function_mapping.AddArrowToSubstrait("add", arrow_add_to_substrait);
function_mapping.AddArrowToSubstrait("add_unchecked", 
arrow_add_unchecked_to_substrait);
{noformat}

> [C++] Add support for registering tricky functions with the Substrait 
> consumer (or add a bunch of substrait meta functions)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15582
>                 URL: https://issues.apache.org/jira/browse/ARROW-15582
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>              Labels: substrait
>
> Sometimes one Substrait function will map to multiple Arrow functions.  For 
> example, the Substrait {{add}} function might be referring to Arrow's {{add}} 
> or {{add_checked}}.  We need to figure out how to register this correctly 
> (e.g. one possible approach would be a {{substrait_add}} meta function).
> Other times a substrait function will encode something Arrow considers an 
> "option" as a function argument.  For example, the is_in Arrow function is 
> unary with an option for the lookup set.  The substrait function is binary 
> but the second argument must be constant and be the lookup set.  Neither of 
> which is to be confused with a truly binary is_in function which takes in a 
> different set at every row.
> It's possible there is no work to do here other than adding a bunch of 
> substrait_ meta functions in Arrow.  In that case all the work will be done 
> in other JIRAs.  Or, it is possible that there is some kind of extension we 
> can make to the function registry that bypasses the need for the meta 
> functions.  I'm leaving this JIRA open so future contributors can consider 
> this second option.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to