felipecrv commented on code in PR #38027:
URL: https://github.com/apache/arrow/pull/38027#discussion_r1394318942


##########
cpp/src/arrow/array/data.h:
##########
@@ -434,6 +434,21 @@ struct ARROW_EXPORT ArraySpan {
     return GetValues<T>(i, this->offset);
   }
 
+  // Access a buffer's data as a span
+  template <typename T>
+  util::span<const T> GetSpan(int i) const {
+    return util::span(buffers[i].data_as<T>(),
+                      static_cast<size_t>(buffers[i].size) / sizeof(T))
+        .subspan(offset);
+  }
+
+  template <typename T>
+  util::span<T> GetSpan(int i) {
+    return util::span(buffers[i].mutable_data_as<T>(),
+                      static_cast<size_t>(buffers[i].size) / sizeof(T))
+        .subspan(offset);

Review Comment:
   I think the value-add comes from a reduced risk of code that doesn't account 
for the fact that the returned `span` has (sometimes, very rarely) length 
greater than the array length and even-worse: the fact that the data after the 
end of the length could be non-sensical (e.g. zeroes at the end of an offsets 
buffer).
   
   We risk having compute kernel code that processes these spans without issue 
until one that's greater than the array length comes up. Taking the length as 
parameter, drastically removes this branch in the program graph: the span 
always has the array length instead of possibly also having a greater length 
(in rare cases!).
   
   Regarding this:
   
   ```diff
   -data.GetSpan<int32_t>(1).subspan(0, 50);
   +data.GetSpan<int32_t>(1, /*length=*/50);
   ```
   
   For what I'm proposing these are not equivalent. I want to assert when the 
passed length is greater than the available buffer size instead of truncating 
like `subspan` does. I would also consider debug-build asserts that take into 
account the array type and buffer index -- something that `span::subspan()` 
can't do.
   
   ```cpp
     constexpr span subspan(size_t offset) const {
       if (offset > size_) return {data_, data_};
       return {data_ + offset, size_ - offset};
     }
   ```
   
   Regarding this suggestion:
   
   ```diff
   -data.GetSpan<int32_t>(1).subspan(0, 50);
   +data.GetSpan<int32_t>(1, /*one_extra=*/true); // offsets buffer
   ```
   
   I think this would be a very ergonomic API that could in turn be defined in 
terms of the `GetSpan` implementation that takes a `length`.
   
   > it'll never be correct to use these on a null bitmap
   
   Definitely. We should assert (or at least document) that `i > 0`.
   
   A final observation: we already have functions that return the base pointer 
of the buffer (shifted by offset), if we are going to provide span-returning 
functions, we better make the length of the span maximally useful since that's 
exactly what spans add compared to raw pointers -- a length.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to