GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/11209
[SPARK-13325][SQL] Create a 64-bit hashcode expression [POC]
This PR is POC for a 64-bit hashcode expression. Such an expression is
especially usefull for HyperLogLog++ and other probabilistic datastructures.
I have implemented xxHash64 which is a 64-bit hashing algorithm created by
Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs
at memory bandwidth) and high quality hashcode. It exploits both Instruction
Level Parralellism (for speed) and the multiplication and rotation techniques
(for quality) like MurMurHash does.
The initial results are promising. I have added a CG'ed test to the
`HashBenchmark`, and this results in the following results (running from SBT):
Running benchmark: Hash For simple
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
Hash For simple: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 1011 / 1016 132.8
7.5 1.0X
codegen version 1864 / 1869 72.0
13.9 0.5X
codegen version 64-bit 1614 / 1644 83.2
12.0 0.6X
Running benchmark: Hash For normal
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
Hash For normal: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 2467 / 2475 0.9
1176.1 1.0X
codegen version 2008 / 2115 1.0
957.5 1.2X
codegen version 64-bit 728 / 758 2.9
347.0 3.4X
Running benchmark: Hash For array
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
Hash For array: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 1544 / 1707 0.1
11779.6 1.0X
codegen version 2728 / 2745 0.0
20815.5 0.6X
codegen version 64-bit 2508 / 2549 0.1
19132.8 0.6X
Running benchmark: Hash For map
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
Hash For map: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 1819 / 1826 0.0
444014.3 1.0X
codegen version 183 / 194 0.0
44642.9 9.9X
codegen version 64-bit 173 / 174 0.0
42120.9 10.5X
This shows that algorithm is consistently faster than MurMurHash32 in all
cases and up to 3x (!) in the normal case.
I have also hacked the Hashcode into HyperLogLog++ and it cuts the
processing time of the following code in half:
val df = sqlContext.range(1<<25).agg(approxCountDistinct("id"))
df.explain()
val t = System.nanoTime()
df.show()
val ns = System.nanoTime() - t
// Before
ns: Long = 5821524302
// After
ns: Long = 2836418963
This is a POC, it should pass test, however the expression code (copied)
and the HyperLogLog++ code are a bit hacky.
cc @cloud-fan (you have been working on hashcodes) / @rxin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark xxHash
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11209.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11209
----
commit c792094a3482efc032a03369780f99d3b4a86055
Author: Herman van Hovell <[email protected]>
Date: 2016-01-31T11:54:15Z
Add xxHash64 hasher.
commit 5987f8fcef9badcd2135e2b92d90fc21837bca2c
Author: Herman van Hovell <[email protected]>
Date: 2016-02-15T19:41:45Z
Merge remote-tracking branch 'apache-github/master' into xxHash
commit f14c50cabd14774cd39aa4e07b6233591f5187b4
Author: Herman van Hovell <[email protected]>
Date: 2016-02-15T21:26:34Z
Create POC
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]