GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/11209

    [SPARK-13325][SQL] Create a 64-bit hashcode expression [POC]

    This PR is POC for a 64-bit hashcode expression. Such an expression is 
especially usefull for HyperLogLog++ and other probabilistic datastructures.
    
    I have implemented xxHash64 which is a 64-bit hashing algorithm created by 
Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs 
at memory bandwidth) and high quality hashcode. It exploits both Instruction 
Level Parralellism (for speed) and the multiplication and rotation techniques 
(for quality) like MurMurHash does.
    
    The initial results are promising. I have added a CG'ed test to the 
`HashBenchmark`, and this results in the following results (running from SBT):
    
        Running benchmark: Hash For simple
          Running case: interpreted version
          Running case: codegen version
          Running case: codegen version 64-bit
        
        Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
        Hash For simple:                    Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
        
-------------------------------------------------------------------------------------------
        interpreted version                      1011 / 1016        132.8       
    7.5       1.0X
        codegen version                          1864 / 1869         72.0       
   13.9       0.5X
        codegen version 64-bit                   1614 / 1644         83.2       
   12.0       0.6X
        
        Running benchmark: Hash For normal
          Running case: interpreted version
          Running case: codegen version
          Running case: codegen version 64-bit
        
        Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
        Hash For normal:                    Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
        
-------------------------------------------------------------------------------------------
        interpreted version                      2467 / 2475          0.9       
 1176.1       1.0X
        codegen version                          2008 / 2115          1.0       
  957.5       1.2X
        codegen version 64-bit                    728 /  758          2.9       
  347.0       3.4X
        
        Running benchmark: Hash For array
          Running case: interpreted version
          Running case: codegen version
          Running case: codegen version 64-bit
        
        Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
        Hash For array:                     Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
        
-------------------------------------------------------------------------------------------
        interpreted version                      1544 / 1707          0.1       
11779.6       1.0X
        codegen version                          2728 / 2745          0.0       
20815.5       0.6X
        codegen version 64-bit                   2508 / 2549          0.1       
19132.8       0.6X
        
        Running benchmark: Hash For map
          Running case: interpreted version
          Running case: codegen version
          Running case: codegen version 64-bit
        
        Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
        Hash For map:                       Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
        
-------------------------------------------------------------------------------------------
        interpreted version                      1819 / 1826          0.0      
444014.3       1.0X
        codegen version                           183 /  194          0.0       
44642.9       9.9X
        codegen version 64-bit                    173 /  174          0.0       
42120.9      10.5X
    
    This shows that algorithm is consistently faster than MurMurHash32 in all 
cases and up to 3x (!) in the normal case.
    
    I have also hacked the Hashcode into HyperLogLog++ and it cuts the 
processing time of the following code in half:
    
        val df = sqlContext.range(1<<25).agg(approxCountDistinct("id"))
        df.explain()
        val t = System.nanoTime()
        df.show()
        val ns = System.nanoTime() - t
    
        // Before
        ns: Long = 5821524302
    
        // After
        ns: Long = 2836418963
    
    This is a POC, it should pass test, however the expression code (copied) 
and the HyperLogLog++ code are a bit hacky.
    
    cc @cloud-fan (you have been working on hashcodes) / @rxin

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark xxHash

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11209.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11209
    
----
commit c792094a3482efc032a03369780f99d3b4a86055
Author: Herman van Hovell <[email protected]>
Date:   2016-01-31T11:54:15Z

    Add xxHash64 hasher.

commit 5987f8fcef9badcd2135e2b92d90fc21837bca2c
Author: Herman van Hovell <[email protected]>
Date:   2016-02-15T19:41:45Z

    Merge remote-tracking branch 'apache-github/master' into xxHash

commit f14c50cabd14774cd39aa4e07b6233591f5187b4
Author: Herman van Hovell <[email protected]>
Date:   2016-02-15T21:26:34Z

    Create POC

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to