jorgecarleitao commented on pull request #844: URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895363474
Ok, maybe I am misunderstanding, sorry, it has been a while. If I recall, we will need to perform `N x M` comparisons where `N` is the number of rows in the batch and `M` the distinct number of items in a group, [around here](https://github.com/apache/arrow-datafusion/pull/808/files#diff-03876812a8bef4074e517600fdcf8e6b49f1ea24df44905d6d806836fd61b2a8R376), roughly represented in `for (row, hash) in batch_hashes.into_iter().enumerate()` and the inner `group_values.iter()....all(op)`. The implementation `array_eq` will promote an non-vectorized approach where each operation requires a downcast and some conversions, i.e. it needs to check type (`downcast`), 2 bound checks (`.is_valid` and `.value`) and works on non-aligned memory (i.e. not all comparisons are done at once). The suggestion to use the kernels to use a vectorized comparison, which leverages an aligned memory, no bound checks, and no type checking (i.e. no per item downcast). Sorry I do not have any code :/, was just a comment hinting to the opportunity to vectorize the operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
