jorgecarleitao edited a comment on pull request #8116:
URL: https://github.com/apache/arrow/pull/8116#issuecomment-688034629


   @paddyhoran , thanks a lot for the feedback, it really helps to understand a 
bit better the context, as I am not aware of those practices.
   
   I do think that both builders and `from` use `memory.rs`. For primitive 
types, 
   
   * the builder uses a `BufferBuilder` and `BooleanBufferBuilder`, which in 
turn uses `memory` 
[here](https://github.com/apache/arrow/blob/master/rust/arrow/src/array/builder.rs#L463)
   * the `from` uses `Buffer`, which uses `memory` 
[here](https://github.com/apache/arrow/blob/master/rust/arrow/src/array/array.rs#L737)
   
   There are two main differences between the two:
   
   * A) `from` requires an intermediary allocation (to vec), while the builder 
does not (uses a MutableBuffer).
   * B) `builder` assess whether it needs to change capacity on every item, 
while `from` does not.
   
   Specifically, `from` uses `Buffer::from(data.to_byte_slice())`, while the 
`builder` uses `for datum in data builder.append(datum); builder.finish()`. 
Therefore, `builders` perform checks on its capacity to ensure that each 
`append` does not require a `resize` (see impl of `append`, 
`BufferBuilderTrait<T> for BufferBuilder`, that has a `reserve(1)?` on it, and 
`advance`, used in nulls, that has a `resize` on it).
   
   My hypothesis for this performance change is that B) is more CPU intensive 
than A).
   
   One solution is to offer an API to the `Builder`s to not perform those 
checks when the user already set the capacity. This would address `B)` and 
would avoid an intermediary allocation.
   
   In this direction, one idea is to support appending from an 
`ExactSizeIterator` (e.g. `append_iter`) and use `size_hint` to not perform 
bound checks on every item.
   
   -----------------------
   
   A common operation we have in the readers and compute is
   
   ```
   rows: &[Value];
   
   for row in rows {
       match some_operation(row) {
            Some(item) => builder.append(item)
            None => builder.append_null()
       }
   }
   ```
   
   which is a basically an `ExactSizeIterator`. However, in the example above, 
the `builder` has no way of knowing that the new append won't go beyond the 
reserved size (and thus it checks).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to