bjchambers commented on pull request #521:
URL: https://github.com/apache/arrow-rs/pull/521#issuecomment-928168017


   > For some background, the query engine I'm working with keeps data in 
memory in big arrow arrays. Processing of the data happens in batches, that are 
created as zero-copy slices of those arrays. So the first batch would start 
with arrays with offset 0, second for example offset 4096 and continuing in the 
same manner. This leads to us usually being the first to notice any offset 
related issues :) For this kernel this would mean that calculating null_if on 
some large input would become quadratic instead of linear, although probably 
with a relatively small constant factor.
   > 
   > To create the result array with an offset of 0, the buffers or child 
arrays would have to be sliced. This depends on the datatype, for example:
   > 
   > Boolean => buffer[0].bit_slice(offset, len) // potentially copies data
   > Primitive => buffer[0].slice(offset * size_of::()) // buffer.slice uses 
byte offsets
   > Struct => ... // needs to slice all child arrays
   > List => ... // should slice the offset buffer, but creation of ListArray 
currently validates that the first offset is 0, so actually needs to calculate 
a new offset buffer
   > 
   > The implementation I have works with the subset of datatypes I'm using, I 
think an implementation inside arrow should better support all datatypes even 
if it has a small performance penalty. Longer-term I think moving the offset 
down into the buffers would be the better general solution that would simplify 
a lot of kernels. I think arrow2 is using that approach successfully.
   
   Note that I'm seeing problems with slices. For instance, the following issue 
is causing this implementation to panic depending on the slice offset -- 
https://github.com/apache/arrow-rs/issues/807. If your implementation is less 
subject to those problems, that may be a good option moving forward. But, it 
looks like you may hit the same problem since your boolean case uses 
`bit_slice`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to