jorgecarleitao commented on pull request #9329: URL: https://github.com/apache/arrow/pull/9329#issuecomment-770440474
> How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include > > * the datatype enum, which can include boxes/vecs of nested types and the field metadata > > * the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>` > > > So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer. > > For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection. I am sorry, I am not sure I follow the arguments: * `DataType`: all our operations already clone `DataType`, since they create a new `ArrayData`. Even something as simple as slicing an array requires cloning a `DataType`, as we need to increase the `offset`, which requires a new `ArrayData`, which requires a new `DataType`. * `Arc<Bytes>`: I do not see how using `Arc<Bytes>` is relevant to why we should use `Arc<ArrayData>`. Regardless, AFAI understand, its usage is really important: `Bytes` is an immutable memory region, and sharing it between arrays imo makes a lot of sense: the most used case is when the null buffer does not change due to an operation. So, the two natural options are `Rc` and an `Arc`, and allowing Arrays to be sharable across thread boundaries seems like a good justification for `Arc`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
