Tracking this with: https://issues.apache.org/jira/browse/COLLECTIONS-855
On Sat, 18 May 2024 at 08:26, Alex Herbert <alex.d.herb...@gmail.com> wrote: > Thanks for highlighting this. I did not use the original paper and based > the implementation on Wikipedia. > > I think the issue is that we use i in [0, k); we can correct this by using > i in [1, k]. The order inside the loop would not change but we would have > to decrement i to use in the assignment giving: > > x, y := a(δ) MOD m, b(δ) MOD m > for i := 1 .. k > f[i - 1] := x > x := (x + y) MOD m > y := (y + i) MOD m > > This would fix the issue but has the inelegant extra computation at the > end of the loop that is unused on the final execution. Note that the > current implementation has this inelegant extra computation as well. We can > update to remove this by reducing the loop size by 1 and using the first > index outside the loop. > > When applying this change to the current repo we see 4 failures in the > EnhancedDoubleHasherTest. These are corrected by changing the array of > expected indices. > > Regards, > > Alex > > > > > On Sat, 18 May 2024 at 07:23, Juan Manuel Gimeno Illa <jmgim...@gmail.com> > wrote: > >> The code in the implementation of the EnhancedDoubleHasher is based on the >> wikipedia article https://en.wikipedia.org/wiki/Double_hashing. >> >> The code that appears in the article is: >> >> struct key; /// Opaque >> /// Use other data types when needed. (Must be unsigned for guaranteed >> wrapping.) >> extern unsigned int h1(struct key const *), h2(struct key const *); >> >> /// Calculate k hash values from two underlying hash functions >> /// h1() and h2() using enhanced double hashing. On return, >> /// hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6. >> /// Takes advantage of automatic wrapping (modular reduction) >> /// of unsigned types in C. >> void ext_dbl_hash(struct key const *x, unsigned int hashes[], unsigned >> int n) >> { >> unsigned int a = h1(x), b = h2(x), i; >> >> for (i = 0; i < n; i++) { >> hashes[i] = a; >> a += b; // Add quadratic difference to get cubic >> b += i; // Add linear difference to get quadratic >> // i++ adds constant difference to get linear >> } >> } >> >> And it is based on code from the paper "Bloom Filters in Probabilistic >> Verification" by Peter C. >> Dillinger and Panagiotis Manolios ( >> https://www.khoury.northeastern.edu/~pete/pub/bloom-filters-verification.pdf >> ) >> >> But the code that appears on that paper is: >> >> x, y := a(δ) MOD m, b(δ) MOD m >> f[0] := x >> for i := 1 .. k-1 >> x := (x + y) MOD m >> y := (y + i) MOD m >> f[i] := x >> >> which does the assignments to x, y (a, b in the wikipedia article) in a >> different order. >> >> The same algorithm appears the thesis "Adaptive Approximate State Storage" >> (http://peterd.org/pcd-diss.pdf) >> >> algorithm " Enhanced DoubleHashing" >> >> input x : "the element being added or queried" >> input m : "number of bits in bit vector" >> input k : "number of indices to compute" >> output g[0 .. k - 1] : "array of indices" >> uses h1, h2 : "hash functions" >> >> begin >> a := h1(x) >> b := h2(x) >> g[0] := a >> for i := 1 .. k - 1 >> begin >> a := (a + b) MOD m >> b := (b + i) MOD m >> g[i] := a >> end >> end >> >> It's easy to see that the computation the enhanced double hash computes >> >> hash[i] = h1(x) + i * h2(x) + (i^3 - i) / 6 >> >> if we use a=h1(x) and b=h2(x), generates the sequence: >> >> a, a + b, a + 2 * b + 1, a + 3 * b + 4, a + 4 * b + 10, ... >> >> And both your implementation and the wikipedia article generates >> >> a, a + b, a + 2 * b, a + 3 * b + 1, a + 4 * b + 4, ... >> >> So the last term in the sum lags one step behind. >> >> The solution, I think, is as simple as swapping the assignments of index >> and inc in your implementation. >> >> I suppose the impact of this is nil, but I'm not an expert for judging >> that. >> >> Thanks, >> >> Juan Manuel Gimeno >> >