This is an automated email from the ASF dual-hosted git repository.
zivanfi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 8f1783e PARUQET-1609: Update to use xxHash as hash strategy (#139)
8f1783e is described below
commit 8f1783ec0b273e89c884b46c0f527d0a48321826
Author: Chen, Junjie <[email protected]>
AuthorDate: Thu Jul 4 23:34:02 2019 +0800
PARUQET-1609: Update to use xxHash as hash strategy (#139)
---
BloomFilter.md | 12 ++++++------
src/main/thrift/parquet.thrift | 12 +++++++-----
2 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/BloomFilter.md b/BloomFilter.md
index be27aef..1be752a 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -87,11 +87,11 @@ void Mask(uint32_t key, uint32_t mask[8]) {
```
#### Hash Function
-The function used to hash values in the initial implementation is
MurmurHash3[4], using
-the least-significant 64 bits of the 128-bit version of the function on the
x86-64
-platform. Note that the function produces different values on different
architectures, so
-implementors must be careful to use the version specific to x86-64. That
function can be
-emulated on different platforms without difficulty.
+The function used to hash values in the initial implementation is xxHash[4],
using
+the least-significant 64 bits version of the function on the x86-64 platform.
Note that
+the function produces different values on different architectures, so
implementors must
+be careful to use the version specific to x86-64. That function can be
emulated on
+different platforms without difficulty.
#### Build a Bloom filter
The fact that exactly eight bits are checked during each lookup means that
these filters
@@ -116,5 +116,5 @@ hash function.
1. [Bloom filter introduction at
Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
2. [Cache-, Hash- and Space-Efficient Bloom
Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
3. [A Reliable Randomized Algorithm for the Closest-Pair
Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
-4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
+4. [xxHash](https://cyan4973.github.io/xxHash/)
5. [Network Applications of Bloom Filters: A
Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index d09bb76..231705f 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -569,17 +569,19 @@ union BloomFilterAlgorithm {
/** Block-based Bloom filter. **/
1: SplitBlockAlgorithm BLOCK;
}
-/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the
original SMHasher
- * repo by Austin Appleby.
+
+/** Hash strategy type annotation. xxHash is an extremely fast
non-cryptographic hash
+ * algorithm. It uses 64 bits version of xxHash.
**/
-struct Murmur3Hash {}
+struct XxHash {}
+
/**
* The hash function used in Bloom filter. This function takes the hash of a
column value
* using plain encoding.
**/
union BloomFilterHash {
- /** Murmur3 Hash Strategy. **/
- 1: Murmur3Hash MURMUR3;
+ /** xxHash Strategy. **/
+ 1: XxHash XXHASH;
}
/**
* Bloom filter header is stored at beginning of Bloom filter data of each
column