jvanstraten commented on code in PR #12852:
URL: https://github.com/apache/arrow/pull/12852#discussion_r853049018
##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -158,14 +178,14 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
/// An extension set should instead be created using
/// arrow::engine::GetExtensionSetFromPlan
static Result<ExtensionSet> Make(
- std::vector<util::string_view> uris, std::vector<Id> type_ids,
+ std::unordered_map<uint32_t, util::string_view> uris, std::vector<Id>
type_ids,
std::vector<bool> type_is_variation, std::vector<Id> function_ids,
ExtensionIdRegistry* = default_extension_id_registry());
// index in these vectors == value of _anchor/_reference fields
/// TODO(ARROW-15583) this assumes that _anchor/_references won't be huge,
which is not
/// guaranteed. Could it be?
Review Comment:
This comment is no longer applicable.
##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -226,14 +246,10 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
private:
ExtensionIdRegistry* registry_;
/// The subset of extension registry URIs referenced by this extension set
- std::vector<util::string_view> uris_;
+ std::unordered_map<uint32_t, util::string_view> uris_;
std::vector<TypeRecord> types_;
-
std::vector<FunctionRecord> functions_;
Review Comment:
These should also be converted to `unordered_map`s; in fact, they are more
important to convert, as there will be lots more functions and types than URIs
in a typical plan.
##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element,
std::vector<T>* vector) {
}
(*vector)[i] = static_cast<T>(element);
}
Review Comment:
When the code is reimplemented as it should be, this function should become
dead code (and be removed), as it is what's causing the potentially large
allocations.
##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element,
std::vector<T>* vector) {
}
(*vector)[i] = static_cast<T>(element);
}
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key,
value>* map) {
+ DCHECK_LE(i, 1 << 20);
+ if (i >= map->size()) {
+ map->reserve(i + 1);
+ }
Review Comment:
This is no longer applicable when a map is used. The `reserve()` call is
actually highly counterproductive, since a map with the single key 100000 will
reserve room for 100001 key-value pairs (this is, in fact, the core problem
that needs solving).
Once these lines are removed, I'd argue that the whole function reduces to
nothing and should just be inlined.
##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element,
std::vector<T>* vector) {
}
(*vector)[i] = static_cast<T>(element);
}
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key,
value>* map) {
+ DCHECK_LE(i, 1 << 20);
+ if (i >= map->size()) {
+ map->reserve(i + 1);
+ }
+ (*map)[i] = static_cast<value>(element);
+}
+
} // namespace
Result<ExtensionSet> GetExtensionSetFromPlan(const substrait::Plan& plan,
ExtensionIdRegistry* registry) {
- std::vector<util::string_view> uris;
+ std::unordered_map<uint32_t, util::string_view> uris;
for (const auto& uri : plan.extension_uris()) {
- SetElement(uri.extension_uri_anchor(), uri.uri(), &uris);
+ SetMapElement(uri.extension_uri_anchor(), uri.uri(), &uris);
}
// NOTE: it's acceptable to use views to memory owned by plan;
ExtensionSet::Make
Review Comment:
GitHub won't let me add comments to code this far away from a change, but
these vectors a few lines further down (as well as the equivalent fields of
`ExtensionSet`) should also be replaced with maps:
```c++
std::vector<Id> type_ids, function_ids;
std::vector<bool> type_is_variation;
```
When you change this and everything affected by it, you'll find that
`ExtensionSet::EncodeType` and `ExtensionSet::EncodeFunction` won't work
anymore as written. Functionally, they include code to generate a key that
wasn't used before, which is easy with a vector, but for a map would require
searching for an unused key by iterating over all possible keys until you find
one. It might therefore be better to use an ordered `map`, in which case you
can use something like `map::rbegin()->first + 1` for the typical case (though
you'd still need to implement the empty map and overflow edge cases; in the
former case you should return 1, and in the latter case you'd need to
exhaustively search for a free key anyway). Don't use anchor 0 (the current
implementation probably does); that anchor value is reserved for undefined
references in some contexts. If you really want to use an `unordered_map`
instead of a `map`, you can also keep track of the latest key that was added to
the map and use
it as a start position for looking for free anchors; that'd still be pretty
efficient in the typical case, but requires a bit more bookkeeping.
Furthermore, as I pointed out in my comment to the JIRA, using a single
anchor map for both types and type variations, and then using
`type_is_variation` to distinguish between types and type variation, was a
misinterpretation of the Substrait specification as far as I'm concerned: it's
perfectly valid (or at least, unambiguous) for a plan to define a type with
anchor 1 as well as a completely unrelated type variation with anchor 1. To be
clear, a correct, complete implementation should look something like this:
```c++
std::[unordered_]map<uint32_t, Id> type_ids, type_variation_ids,
function_ids;
```
That's a lot more involved though, as the assumption that a user-defined
type or type variation is uniquely identified by a single number is pretty
deeply ingrained in the round-tripping code. However, since type variations are
not (correctly) implemented anyway, it's probably better for now to just remove
the broken and incomplete code related to type variations for now, so it'd just
become
```c++
std::[unordered_]map<uint32_t, Id> type_ids, function_ids;
```
On the consumer end, a nonzero type variation reference [already yields an
error](https://github.com/sanjibansg/arrow/blob/b6b445c16f7829a26b91ab447e36d8cddbc32ce3/cpp/src/arrow/engine/substrait/type_internal.cc#L43)
(this is why anchor 0 should not be used, by the way). The type variation
anchor definitions in the `kExtensionTypeVariation` case in the loop below
could just be silently ignored (which is what I would personally do), but it'd
be more in line with the rest of the implementation to emit an error when
encountering those too, as the implementation aims to error out on anything
that wouldn't round-trip perfectly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]