felipecrv commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1170157451
##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off,
int64_t len) const {
} else {
copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
}
+ for (auto& child : copy->child_data)
+ child = child->Slice(copy->offset, copy->length);
Review Comment:
@Light-City *avoiding materialization* is a common theme of query execution.
Databases are often working on data that is larger than memory or taking all
the memory with the dataset itself. Allocating more memory to produce
intermediate results is a no-no. And this tradition is honored by the Arrow
design. That's why it can be seen as a bit counterintuitive for people used to
array functions of common programming languages.
For instance, in JavaScript, `Array.prototype.slice(begin, end)` creates a
new array. In database-speak this is worded as "arr.slice() materializes the
slice". An alternative implementation could return an object with a reference
to the original array and the bounds of the slice to avoid memory allocation.
That would require that every function you normally use to work with arrays be
aware of the slice boundaries. It would complicate JavaScript too much, but for
Arrow that is exactly the intended design. Every compute kernel in Arrow has to
be aware of the offset and length to minimize materialization, thus minimizing
memory consumption.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]