[GitHub] [arrow] jvanstraten commented on a diff in pull request #12852: ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors

GitBox Fri, 22 Apr 2022 10:56:03 -0700


jvanstraten commented on code in PR #12852:
URL: https://github.com/apache/arrow/pull/12852#discussion_r856457349



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, 
std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key, 
value>* map) {
+  DCHECK_LE(i, 1 << 20);
+  if (i >= map->size()) {
+    map->reserve(i + 1);
+  }
+  (*map)[i] = static_cast<value>(element);
+}
+
 }  // namespace
 
 Result<ExtensionSet> GetExtensionSetFromPlan(const substrait::Plan& plan,
                                              ExtensionIdRegistry* registry) {
-  std::vector<util::string_view> uris;
+  std::unordered_map<uint32_t, util::string_view> uris;
   for (const auto& uri : plan.extension_uris()) {
-    SetElement(uri.extension_uri_anchor(), uri.uri(), &uris);
+    SetMapElement(uri.extension_uri_anchor(), uri.uri(), &uris);
   }
 
   // NOTE: it's acceptable to use views to memory owned by plan; 
ExtensionSet::Make

Review Comment:
   > I don't think an `ordered_map` is needed because we can come up with 
unique ids by just looking at the size of the map (if 0 is not valid then we 
can add `1`.)
   
   That doesn't work, because there's no requirement for the indices to be 
sequential for incoming plans. For example, if the indices 1 and 3 are defined, 
and a new type is added on the production side, it would erroneously try to 
reuse map_size + 1 = 3 by this logic. The whole reason why we want to use maps 
at all is because they don't have to be sequential, otherwise it'd be much 
cheaper to just use vectors. You could use it as a cheap starting point when 
looking for unused anchors though; if they *are* sequential it would 
immediately yield a valid index.
   
   > until it's clear what they actually represent.
   
   They're like a poor-man's version of polymorphism, and a way to attach 
logical type information that doesn't impact the physical representation. For 
example, one might define a custom type `polygon`, with type variations 
`triangle` and `quad`, written as `polygon[triangle]` and `polygon[quad]` 
respectively. They'd all have the same type index and the same physical storage 
(probably some list of coordinate structs), but a function can then be written 
to only accept `polygon[triangle]`, and a function that accepts polygons may or 
may not also accept `polygon[quad]`s and `polygon[triangle]`s (this depends on 
how the variation is defined in the YAML files). Note that variations can also 
be defined for Substrait's built-in types. Ultimately, a Substrait type 
consists of its physical base type (a simple type, compound type, or 
user-defined type), its nullability, a variation index (using 0 for "no 
variation"), and (for compound types and hopefully at some point also 
user-defined types) 
 a set of template parameters.
   
   Unless we really want to round-trip Substrait plans or validate them 
completely (having spent months on just a validator that still can't do 
everything, I don't recommend it), we could just ignore variations entirely. 
Even if someone else defined a variation in an extension we don't know about, 
it would have no bearing on how the plan is executed, since we only need to 
know what the physical representation of a type is. It's more of a producer or 
optimizer thing, to restrict which functions are available.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jvanstraten commented on a diff in pull request #12852: ARROW-15583: [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors

Reply via email to