[
https://issues.apache.org/jira/browse/ARROW-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace resolved ARROW-15583.
---------------------------------
Fix Version/s: 9.0.0
Resolution: Fixed
Issue resolved by pull request 12852
[https://github.com/apache/arrow/pull/12852]
> [C++] The Substrait consumer could potentially use a massive amount of RAM if
> the producer uses large anchors
> -------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-15583
> URL: https://issues.apache.org/jira/browse/ARROW-15583
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Assignee: Sanjiban Sengupta
> Priority: Major
> Labels: pull-request-available, substrait
> Fix For: 9.0.0
>
> Time Spent: 5.5h
> Remaining Estimate: 0h
>
> In Substrait a function is referred to by a "fully qualified name" which
> consists of a URI and a function name. For example, the "add" function is
> something like
> {{https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml}}.
> To avoid serializing these long names multiple times in the plan the
> producer should pick an anchor value (an int32 in protobuf) and use that
> everywhere (with a single lookup table at the top level of the plan).
> To avoid map lookups the Arrow C++ consumer currently assumes that this
> lookup table will be small enough it can be stored in a vector...
> {noformat}
> {
>
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#add",
>
> "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#subtract"
> }
> {noformat}
> However, this sort of assumes that a plan is going to use numbers like 0, 1,
> 2, ... N to create N anchors. There is nothing that prevents a consumer from
> using whatever numbers it wants (e.g. a pointer value). If the producer uses
> a really large anchor value then the C++ Substrait consumer will create a
> lookup table with a lot of blank values. This could lead to a lot of wasted
> memory.
> We could try and request the Substrait spec enfoce small anchors or we could
> change the extension set handling in the C++ consumer to use an unordered_map.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)