jhorstmann commented on pull request #521:
URL: https://github.com/apache/arrow-rs/pull/521#issuecomment-922822627


   For some background, the query engine I'm working with keeps data in memory 
in big arrow arrays. Processing of the data happens in batches, that are 
created as zero-copy slices of those arrays. So the first batch would start 
with arrays with offset 0, second for example offset 4096 and continuing in the 
same manner. This leads to us usually being the first to notice any offset 
related issues :) For this kernel this would mean that calculating null_if on 
some large input would become quadratic instead of linear, although probably 
with a relatively small constant factor.
   
   To create the result array with an offset of 0, the buffers or child arrays 
would have to be sliced. This depends on the datatype, for example:
   
   Boolean => buffer[0].bit_slice(offset, len) // potentially copies data
   Primitive<T> => buffer[0].slice(offset * size_of::<T>()) // buffer.slice 
uses byte offsets
   Struct => ... // needs to slice all child arrays
   List => ... // should slice the offset buffer, but creation of ListArray 
currently validates that the first offset is 0, so actually needs to calculate 
a new offset buffer
   
   The implementation I have works with the subset of datatypes I'm using, I 
think an implementation inside arrow should better support all datatypes even 
if it has a small performance penalty. Longer-term I think moving the offset 
down into the buffers would be the better general solution that would simplify 
a lot of kernels. I think arrow2 is using that approach successfully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to