nevi-me commented on a change in pull request #7365:
URL: https://github.com/apache/arrow/pull/7365#discussion_r436354577
##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -577,6 +632,81 @@ where
self
}
+ /// Appends data from other arrays into the builder
+ ///
+ /// This is most useful when concatenating arrays of the same type into a
builder.
+ fn append_data(&mut self, data: &[ArrayDataRef]) -> Result<()> {
+ if !check_array_data_type(&self.data_type(), data) {
+ return Err(ArrowError::InvalidArgumentError(
+ "Cannot append data to builder if data types are
different".to_string(),
+ ));
+ }
+ // determine the latest offset on the builder
+ let mut cum_offset = if self.offsets_builder.len() == 0 {
+ 0
+ } else {
+ // peek into buffer to get last appended offset
+ let buffer = self.offsets_builder.buffer.data();
+ let len = self.offsets_builder.len();
+ let (start, end) = ((len - 1) * 4, len * 4);
+ let slice = &buffer[start..end];
+ i32::from_le_bytes(slice.try_into().unwrap())
+ };
+ for array in data {
+ if array.child_data().len() != 1 {
Review comment:
> But I do think we should make sure error handling behavior is
consistent, i.e. invalid input should not lead to partial append to the array.
If we are to use ArrayDataRef as input type, then I feel like we do need to
have custom validation logic for each array type as you mentioned.
Thanks, I agree. Changes made in my latest commit.
> The idea of using ArrayRef as input type is interesting. It does look like
a simpler interface for end users and could simplify the error handling logic.
The preferred way to convert `&[ArrayRef]` to `ArrayRef` will be the
`concat` kernel that you've added. concat might then forego its current
validation, and potentially be like:
```rust
pub fn concat(array_list: &[ArrayRef]) -> Result<ArrayRef> {
// get data type from first element
// create builder for data type (this'll have to cater for structs and
lists)
// pass `ArrayDataRef`to builder
// finish builder and return `ArrayRef`
}
```
> What's the downside of using ArrayRef here compared to ArrayDataRef?
`ArrayDataRef` is more flexible. If someone is creating Arrow data from raw
data, there currently isn't much flexibility for them, especially when working
with nested data structures. It might be more convenient to then create
`ArrayData` instead of going all the way to create an array only to append it
to a builder.
Constructing an `ArrayRef` to append is an extra step and at worst requires
going through `arrow::utils::make_array(data: ArrayRef)`.
The upside of `ArrayRef` is skipping the validation checks, though I wonder
what cost the checks result in. We can wait for other reviewers' opinions on
their necessity.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]