[ 
https://issues.apache.org/jira/browse/IGNITE-14769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450415#comment-17450415
 ] 

Konstantin Orlov commented on IGNITE-14769:
-------------------------------------------

I've checked different version of RowAssembler with different hash algorithms:

1) Current implementation where {{hash = 31 * prev_hash + 
hash_of_current_field}}
2) Implementation of RowAssembler with MurMur3 hash over the whole key chunk 
(copy-pasted [2])
3) Implementation of RowAssembler with FastHash hash over the whole key chunk 
(ported cpp code from [4])
4) Implementation of RowAssembler with xxHash hash over the whole key chunk 
({{{}org.lz4:lz4-java:1.8.0{}}} dependency were used)
 
The results are below:
h4. Count of collisions

Row count: 1 000 000; field count: 3
|| ||Current||MurMur3||FastHash||XXHash||
||Row(0..99, 0..99, 0..99)|901692|118|114|125|
||Row(rnd(10^6), rnd(10^6), rnd(10^6))|475|114|110|109|
||Row(0..99, 0..99, rnd(UUID))|107|133|128|147|
h4. Throughput of new RowAssembler().<appends>.build()

Both key and value have _fieldsCount_ integer fields.
||Benchmark||(fieldsCount)||Mode||Cnt||Score||Error||Units||
|RowAssemblerBenchmarkTest.rowAssembler|1|thrpt|10|14736.929|± 487.418|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler|4|thrpt|10|8076.515|± 121.236|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler|10|thrpt|10|4048.722|± 122.058|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler|100|thrpt|10|500.524|± 15.781|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|1|thrpt|10|13200.977|± 
171.259|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|4|thrpt|10|7199.640|± 
137.080|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|10|thrpt|10|4106.971|± 
232.045|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|100|thrpt|10|464.724|± 
14.365|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|1|thrpt|10|12138.946|± 
477.850|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|4|thrpt|10|6420.024|± 
218.321|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|10|thrpt|10|3520.215|± 
240.430|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|100|thrpt|10|418.010|± 
65.750|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|1|thrpt|10|12429.161|± 
940.685|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|4|thrpt|10|6929.653|± 
704.260|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|10|thrpt|10|3552.725|± 
288.847|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|100|thrpt|10|501.660|± 
6.425|ops/ms|

Current implementation suffers from poor hash distribution in case of 
sequentially increasing fields of the key, whereas hashing over the whole chunk 
could offer a pretty good distribution for all cases at cost of slightly lower 
performance. So I would propose to use either fast_hash or murmur3 since it 
would be easier to bring them to our codebase. FastHash seems to me a bit 
better option.

[~amashenkov], WDYT?

> Key hash calculation.
> ---------------------
>
>                 Key: IGNITE-14769
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14769
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Andrey Mashenkov
>            Assignee: Konstantin Orlov
>            Priority: Major
>              Labels: iep-54, ignite-3
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> There are next possible ways for cache calculation.
>  # Update hash on every write method call as it works for now.
>  # Calculate for all key chunk (hash of byte[]) - all columns including a 
> null-map.
> Let's choose and implement the best way and along with a better hash function,
>  e.g. xxHash64 [1], Murmur3 [2]released in Apache Commons, CityHash (from 
> Google) [3], FastHash32 [4].
>  
> [1][https://github.com/Cyan4973/xxHash/]
> [2]https://commons.apache.org/proper/commons-codec/jacoco/org.apache.commons.codec.digest/MurmurHash3.java.html
> [3] [https://github.com/google/cityhash]
> [4] https://github.com/rurban/smhasher/blob/master/fasthash.cpp



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to