[
https://issues.apache.org/jira/browse/IGNITE-14769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450415#comment-17450415
]
Konstantin Orlov commented on IGNITE-14769:
-------------------------------------------
I've checked different version of RowAssembler with different hash algorithms:
1) Current implementation where {{hash = 31 * prev_hash +
hash_of_current_field}}
2) Implementation of RowAssembler with MurMur3 hash over the whole key chunk
(copy-pasted [2])
3) Implementation of RowAssembler with FastHash hash over the whole key chunk
(ported cpp code from [4])
4) Implementation of RowAssembler with xxHash hash over the whole key chunk
({{{}org.lz4:lz4-java:1.8.0{}}} dependency were used)
The results are below:
h4. Count of collisions
Row count: 1 000 000; field count: 3
|| ||Current||MurMur3||FastHash||XXHash||
||Row(0..99, 0..99, 0..99)|901692|118|114|125|
||Row(rnd(10^6), rnd(10^6), rnd(10^6))|475|114|110|109|
||Row(0..99, 0..99, rnd(UUID))|107|133|128|147|
h4. Throughput of new RowAssembler().<appends>.build()
Both key and value have _fieldsCount_ integer fields.
||Benchmark||(fieldsCount)||Mode||Cnt||Score||Error||Units||
|RowAssemblerBenchmarkTest.rowAssembler|1|thrpt|10|14736.929|± 487.418|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler|4|thrpt|10|8076.515|± 121.236|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler|10|thrpt|10|4048.722|± 122.058|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler|100|thrpt|10|500.524|± 15.781|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|1|thrpt|10|13200.977|±
171.259|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|4|thrpt|10|7199.640|±
137.080|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|10|thrpt|10|4106.971|±
232.045|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_fasthash|100|thrpt|10|464.724|±
14.365|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|1|thrpt|10|12138.946|±
477.850|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|4|thrpt|10|6420.024|±
218.321|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|10|thrpt|10|3520.215|±
240.430|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_murmur3|100|thrpt|10|418.010|±
65.750|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|1|thrpt|10|12429.161|±
940.685|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|4|thrpt|10|6929.653|±
704.260|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|10|thrpt|10|3552.725|±
288.847|ops/ms|
|RowAssemblerBenchmarkTest.rowAssembler_xxhash|100|thrpt|10|501.660|±
6.425|ops/ms|
Current implementation suffers from poor hash distribution in case of
sequentially increasing fields of the key, whereas hashing over the whole chunk
could offer a pretty good distribution for all cases at cost of slightly lower
performance. So I would propose to use either fast_hash or murmur3 since it
would be easier to bring them to our codebase. FastHash seems to me a bit
better option.
[~amashenkov], WDYT?
> Key hash calculation.
> ---------------------
>
> Key: IGNITE-14769
> URL: https://issues.apache.org/jira/browse/IGNITE-14769
> Project: Ignite
> Issue Type: Improvement
> Reporter: Andrey Mashenkov
> Assignee: Konstantin Orlov
> Priority: Major
> Labels: iep-54, ignite-3
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> There are next possible ways for cache calculation.
> # Update hash on every write method call as it works for now.
> # Calculate for all key chunk (hash of byte[]) - all columns including a
> null-map.
> Let's choose and implement the best way and along with a better hash function,
> e.g. xxHash64 [1], Murmur3 [2]released in Apache Commons, CityHash (from
> Google) [3], FastHash32 [4].
>
> [1][https://github.com/Cyan4973/xxHash/]
> [2]https://commons.apache.org/proper/commons-codec/jacoco/org.apache.commons.codec.digest/MurmurHash3.java.html
> [3] [https://github.com/google/cityhash]
> [4] https://github.com/rurban/smhasher/blob/master/fasthash.cpp
--
This message was sent by Atlassian Jira
(v8.20.1#820001)