Dandandan commented on pull request #922: URL: https://github.com/apache/arrow-datafusion/pull/922#issuecomment-917684838
@novemberkilo thanks for checking this and opening this PR. Next to the remarks by @alamb, I think there is also some more principal change that would be needed to get the branch outside of the loop. Currently, the boolean check `if *$has_nulls` is still done per value inside checking value equality. Ideally, this should be done entirely outside of the loop, as is done in Arrow kernels too. Maybe there would be needed quite some invasive changes / experiments needed to accomplish this. A similar thing that could be tried here as I started (but didn't complete yet) in this PR for the hash join https://github.com/apache/arrow-datafusion/pull/864 is to vectorize the hash collision check: instead of checking the collision for each value individually, it's delayed to a later stage, so it can be done in a more simple loop, which can be specialized for non-null cases. The core idea is that the inner loop should include as little branches / downcasting / etc. as possible. Some other reasons might be that this currently is not a really hot part of the code. This could be checked by running some profiling tools (e.g. `perf` / `valgrind`) which can give hints where the hot code lives. A micro benchmark could help too to more accurately measure performance changes in this part of the code (instead of the performance of executing a full query). I think some other sources of overhead might live in that currently one of the two values is a `ScalarValue` (instead of living in a contiguous array), which is potentially boxed / doesn't have good locality and can not be specialized for having no nulls. Also the `Array` is downcasted in the inner loop based on the datatype. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
