Weston Pace created ARROW-15583:
-----------------------------------

             Summary: [C++] The Substrait consumer could potentially use a 
massive amount of RAM if the producer uses large anchors
                 Key: ARROW-15583
                 URL: https://issues.apache.org/jira/browse/ARROW-15583
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


In Substrait a function is referred to by a "fully qualified name" which 
consists of a URI and a function name.  For example, the "add" function is 
something like 
{{https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml}}.
  To avoid serializing these long names multiple times in the plan the producer 
should pick an anchor value (an int32 in protobuf) and use that everywhere 
(with a single lookup table at the top level of the plan).

To avoid map lookups the Arrow C++ consumer currently assumes that this lookup 
table will be small enough it can be stored in a vector...

{noformat}
{
  
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#add";,
  
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#subtract";
}
{noformat}

However, this sort of assumes that a plan is going to use numbers like 0, 1, 2, 
... N to create N anchors.  There is nothing that prevents a consumer from 
using whatever numbers it wants (e.g. a pointer value).  If the producer uses a 
really large anchor value then the  C++ Substrait consumer will create a lookup 
table with a lot of blank values.  This could lead to a lot of wasted memory.

We could try and request the Substrait spec enfoce small anchors or we could 
change the extension set handling in the C++ consumer to use an unordered_map.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to