Dandandan commented on pull request #922:
URL: https://github.com/apache/arrow-datafusion/pull/922#issuecomment-917684838


   @novemberkilo thanks for checking this and opening this PR.
   
   Next to the remarks by @alamb, I think there is also some more principal 
change that would be needed to get the branch outside of the loop.
   Currently, the boolean check `if *$has_nulls` is still done per value inside 
checking value equality. Ideally, this should be done entirely outside of the 
loop, as is done in Arrow kernels too. Maybe there would be needed quite some 
invasive changes / experiments needed to accomplish this.
   
   A similar thing that could be tried here as I started (but didn't complete 
yet) in this PR for the hash join
   https://github.com/apache/arrow-datafusion/pull/864 is to vectorize the hash 
collision check: instead of checking the collision for each value individually, 
it's delayed to a later stage, so it can be done in a more simple loop, which 
can be specialized for non-null cases.  The core idea is that the inner loop 
should include as little branches / downcasting / etc. as possible.
   
   Some other reasons might be that this currently is not a really hot part of 
the code. This could be checked by running some profiling tools (e.g. `perf` / 
`valgrind`) which can give hints where the hot code lives. A micro benchmark 
could help too to more accurately measure performance changes in this part  of 
the code (instead of the performance of executing a full query).
   
   I think some other sources of overhead might live in that currently one of 
the two values is a `ScalarValue` (instead of living in a contiguous array), 
which is potentially boxed / doesn't have good locality and can not be 
specialized for having no nulls. Also the `Array` is downcasted in the inner 
loop based on the datatype.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to