jhorstmann commented on pull request #521: URL: https://github.com/apache/arrow-rs/pull/521#issuecomment-922822627
For some background, the query engine I'm working with keeps data in memory in big arrow arrays. Processing of the data happens in batches, that are created as zero-copy slices of those arrays. So the first batch would start with arrays with offset 0, second for example offset 4096 and continuing in the same manner. This leads to us usually being the first to notice any offset related issues :) For this kernel this would mean that calculating null_if on some large input would become quadratic instead of linear, although probably with a relatively small constant factor. To create the result array with an offset of 0, the buffers or child arrays would have to be sliced. This depends on the datatype, for example: Boolean => buffer[0].bit_slice(offset, len) // potentially copies data Primitive<T> => buffer[0].slice(offset * size_of::<T>()) // buffer.slice uses byte offsets Struct => ... // needs to slice all child arrays List => ... // should slice the offset buffer, but creation of ListArray currently validates that the first offset is 0, so actually needs to calculate a new offset buffer The implementation I have works with the subset of datatypes I'm using, I think an implementation inside arrow should better support all datatypes even if it has a small performance penalty. Longer-term I think moving the offset down into the buffers would be the better general solution that would simplify a lot of kernels. I think arrow2 is using that approach successfully. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
