GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/10917
[SPARK-12888][SQL][follow-up] benchmark the new hash expression
Adds the benchmark results as comments.
The codegen version is slower than the interpreted version for `simple`
case becasue of 3 reasons:
1. codegen version use a more complex hash algorithm than interpreted
version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and
addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153).
2. codegen version will write the hash value to a row first and then read
it out. I tried to create a `GenerateHasher` that can generate code to return
hash value directly and got about 60% speed up for the `simple` case, does it
worth?
3. the row in `simple` case only has one int field, so the runtime
reflection may be removed because of branch prediction, which makes the
interpreted version faster.
The `array` case is also slow for similar reasons, e.g. array elements are
of same type, so interpreted version can probably get rid of runtime reflection
by branch prediction.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark hash-benchmark
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10917.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10917
----
commit 8207dc109f21527438cbd80894e9b49d63159f12
Author: Wenchen Fan <[email protected]>
Date: 2016-01-26T02:24:38Z
add benchmark results
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]