jvanstraten commented on code in PR #12852:
URL: https://github.com/apache/arrow/pull/12852#discussion_r853049018


##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -158,14 +178,14 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
   /// An extension set should instead be created using
   /// arrow::engine::GetExtensionSetFromPlan
   static Result<ExtensionSet> Make(
-      std::vector<util::string_view> uris, std::vector<Id> type_ids,
+      std::unordered_map<uint32_t, util::string_view> uris, std::vector<Id> 
type_ids,
       std::vector<bool> type_is_variation, std::vector<Id> function_ids,
       ExtensionIdRegistry* = default_extension_id_registry());
 
   // index in these vectors == value of _anchor/_reference fields
   /// TODO(ARROW-15583) this assumes that _anchor/_references won't be huge, 
which is not
   /// guaranteed. Could it be?

Review Comment:
   This comment is no longer applicable.



##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -226,14 +246,10 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
  private:
   ExtensionIdRegistry* registry_;
   /// The subset of extension registry URIs referenced by this extension set
-  std::vector<util::string_view> uris_;
+  std::unordered_map<uint32_t, util::string_view> uris_;
   std::vector<TypeRecord> types_;
-
   std::vector<FunctionRecord> functions_;

Review Comment:
   These should also be converted to `unordered_map`s; in fact, they are more 
important to convert, as there will be lots more functions and types than URIs 
in a typical plan.



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, 
std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }

Review Comment:
   When the code is reimplemented as it should be, this function should become 
dead code (and be removed), as it is what's causing the potentially large 
allocations.



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, 
std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key, 
value>* map) {
+  DCHECK_LE(i, 1 << 20);
+  if (i >= map->size()) {
+    map->reserve(i + 1);
+  }

Review Comment:
   This is no longer applicable when a map is used. The `reserve()` call is 
actually highly counterproductive, since a map with the single key 100000 will 
reserve room for 100001 key-value pairs (this is, in fact, the core problem 
that needs solving).
   
   Once these lines are removed, I'd argue that the whole function reduces to 
nothing and should just be inlined.



##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element, 
std::vector<T>* vector) {
   }
   (*vector)[i] = static_cast<T>(element);
 }
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key, 
value>* map) {
+  DCHECK_LE(i, 1 << 20);
+  if (i >= map->size()) {
+    map->reserve(i + 1);
+  }
+  (*map)[i] = static_cast<value>(element);
+}
+
 }  // namespace
 
 Result<ExtensionSet> GetExtensionSetFromPlan(const substrait::Plan& plan,
                                              ExtensionIdRegistry* registry) {
-  std::vector<util::string_view> uris;
+  std::unordered_map<uint32_t, util::string_view> uris;
   for (const auto& uri : plan.extension_uris()) {
-    SetElement(uri.extension_uri_anchor(), uri.uri(), &uris);
+    SetMapElement(uri.extension_uri_anchor(), uri.uri(), &uris);
   }
 
   // NOTE: it's acceptable to use views to memory owned by plan; 
ExtensionSet::Make

Review Comment:
   GitHub won't let me add comments to code this far away from a change, but 
these vectors a few lines further down (as well as the equivalent fields of 
`ExtensionSet`) should also be replaced with maps:
   
   ```c++
     std::vector<Id> type_ids, function_ids;
     std::vector<bool> type_is_variation;
   ```
   
   When you change this and everything affected by it, you'll find that 
`ExtensionSet::EncodeType` and `ExtensionSet::EncodeFunction` won't work 
anymore as written. Functionally, they include code to generate a key that 
wasn't used before, which is easy with a vector, but for a map would require 
searching for an unused key by iterating over all possible keys until you find 
one. It might therefore be better to use an ordered `map`, in which case you 
can use something like `map::rbegin()->first + 1` for the typical case (though 
you'd still need to implement the empty map and overflow edge cases; in the 
former case you should return 1, and in the latter case you'd need to 
exhaustively search for a free key anyway). Don't use anchor 0 (the current 
implementation probably does); that anchor value is reserved for undefined 
references in some contexts. If you really want to use an `unordered_map` 
instead of a `map`, you can also keep track of the latest key that was added to 
the map and use
  it as a start position for looking for free anchors; that'd still be pretty 
efficient in the typical case, but requires a bit more bookkeeping.
   
   Furthermore, as I pointed out in my comment to the JIRA, using a single 
anchor map for both types and type variations, and then using 
`type_is_variation` to distinguish between types and type variation, was a 
misinterpretation of the Substrait specification as far as I'm concerned: it's 
perfectly valid (or at least, unambiguous) for a plan to define a type with 
anchor 1 as well as a completely unrelated type variation with anchor 1. To be 
clear, a correct, complete implementation should look something like this:
   
   ```c++
     std::[unordered_]map<uint32_t, Id> type_ids, type_variation_ids, 
function_ids;
   ```
   
   That's a lot more involved though, as the assumption that a user-defined 
type or type variation is uniquely identified by a single number is pretty 
deeply ingrained in the round-tripping code. However, since type variations are 
not (correctly) implemented anyway, it's probably better for now to just remove 
the broken and incomplete code related to type variations for now, so it'd just 
become
   
   ```c++
     std::[unordered_]map<uint32_t, Id> type_ids, function_ids;
   ```
   
   On the consumer end, a nonzero type variation reference [already yields an 
error](https://github.com/sanjibansg/arrow/blob/b6b445c16f7829a26b91ab447e36d8cddbc32ce3/cpp/src/arrow/engine/substrait/type_internal.cc#L43)
 (this is why anchor 0 should not be used, by the way). The type variation 
anchor definitions in the `kExtensionTypeVariation` case in the loop below 
could just be silently ignored (which is what I would personally do), but it'd 
be more in line with the rest of the implementation to emit an error when 
encountering those too, as the implementation aims to error out on anything 
that wouldn't round-trip perfectly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to