jvanstraten commented on code in PR #12852:
URL: https://github.com/apache/arrow/pull/12852#discussion_r856457349
##########
cpp/src/arrow/engine/substrait/plan_internal.cc:
##########
@@ -108,13 +108,23 @@ void SetElement(size_t i, const Element& element,
std::vector<T>* vector) {
}
(*vector)[i] = static_cast<T>(element);
}
+
+template <typename Element, typename key, typename value>
+void SetMapElement(key i, const Element& element, std::unordered_map<key,
value>* map) {
+ DCHECK_LE(i, 1 << 20);
+ if (i >= map->size()) {
+ map->reserve(i + 1);
+ }
+ (*map)[i] = static_cast<value>(element);
+}
+
} // namespace
Result<ExtensionSet> GetExtensionSetFromPlan(const substrait::Plan& plan,
ExtensionIdRegistry* registry) {
- std::vector<util::string_view> uris;
+ std::unordered_map<uint32_t, util::string_view> uris;
for (const auto& uri : plan.extension_uris()) {
- SetElement(uri.extension_uri_anchor(), uri.uri(), &uris);
+ SetMapElement(uri.extension_uri_anchor(), uri.uri(), &uris);
}
// NOTE: it's acceptable to use views to memory owned by plan;
ExtensionSet::Make
Review Comment:
> I don't think an `ordered_map` is needed because we can come up with
unique ids by just looking at the size of the map (if 0 is not valid then we
can add `1`.)
That doesn't work, because there's no requirement for the indices to be
sequential for incoming plans. For example, if the indices 1 and 3 are defined,
and a new type is added on the production side, it would erroneously try to
reuse map_size + 1 = 3 by this logic. The whole reason why we want to use maps
at all is because they don't have to be sequential, otherwise it'd be much
cheaper to just use vectors. You could use it as a cheap starting point when
looking for unused anchors though; if they *are* sequential it would
immediately yield a valid index.
> until it's clear what they actually represent.
They're like a poor-man's version of polymorphism, and a way to attach
logical type information that doesn't impact the physical representation. For
example, one might define a custom type `polygon`, with type variations
`triangle` and `quad`, written as `polygon[triangle]` and `polygon[quad]`
respectively. They'd all have the same type index and the same physical storage
(probably some list of coordinate structs), but a function can then be written
to only accept `polygon[triangle]`, and a function that accepts polygons may or
may not also accept `polygon[quad]`s and `polygon[triangle]`s (this depends on
how the variation is defined in the YAML files). Note that variations can also
be defined for Substrait's built-in types. Ultimately, a Substrait type
consists of its physical base type (a simple type, compound type, or
user-defined type), its nullability, a variation index (using 0 for "no
variation"), and (for compound types and hopefully at some point also
user-defined types)
a set of template parameters.
Unless we really want to round-trip Substrait plans or validate them
completely (having spent months on just a validator that still can't do
everything, I don't recommend it), we could just ignore variations entirely.
Even if someone else defined a variation in an extension we don't know about,
it would have no bearing on how the plan is executed, since we only need to
know what the physical representation of a type is. It's more of a producer or
optimizer thing, to restrict which functions are available.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]