felipecrv commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1170157451


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, 
int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   @Light-City *avoiding materialization* is a common theme of query execution. 
Databases are often working on data that is larger than memory or taking all 
the memory with the dataset itself. Allocating more memory to produce 
intermediate results is a no-no. And this tradition is honored by the Arrow 
design. That's why it can be seen as a bit counterintuitive for people used to 
array functions of common programming languages.
   
   For instance, in JavaScript, `Array.prototype.slice(begin, end)` creates a 
new array. In database-speak this is worded as "arr.slice() materializes the 
slice". An alternative implementation could return an object with a reference 
to the original array and the bounds of the slice to avoid memory allocation. 
That would require that every function you normally use to work with arrays be 
aware of the slice boundaries. It would complicate JavaScript too much, but for 
Arrow that is exactly the intended design. Every compute kernel in Arrow has to 
be aware of the offset and length to minimize materialization, thus minimizing 
memory consumption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to