Tracking this with:

https://issues.apache.org/jira/browse/COLLECTIONS-855

On Sat, 18 May 2024 at 08:26, Alex Herbert <alex.d.herb...@gmail.com> wrote:

> Thanks for highlighting this. I did not use the original paper and based
> the implementation on Wikipedia.
>
> I think the issue is that we use i in [0, k); we can correct this by using
> i in [1, k]. The order inside the loop would not change but we would have
> to decrement i to use in the assignment giving:
>
> x, y := a(δ) MOD m, b(δ) MOD m
> for i := 1 .. k
>     f[i - 1] := x
>     x := (x + y) MOD m
>     y := (y + i) MOD m
>
> This would fix the issue but has the inelegant extra computation at the
> end of the loop that is unused on the final execution. Note that the
> current implementation has this inelegant extra computation as well. We can
> update to remove this by reducing the loop size by 1 and using the first
> index outside the loop.
>
> When applying this change to the current repo we see 4 failures in the
> EnhancedDoubleHasherTest. These are corrected by changing the array of
> expected indices.
>
> Regards,
>
> Alex
>
>
>
>
> On Sat, 18 May 2024 at 07:23, Juan Manuel Gimeno Illa <jmgim...@gmail.com>
> wrote:
>
>> The code in the implementation of the EnhancedDoubleHasher is based on the
>> wikipedia article https://en.wikipedia.org/wiki/Double_hashing.
>>
>> The code that appears in the article is:
>>
>> struct key; /// Opaque
>> /// Use other data types when needed. (Must be unsigned for guaranteed
>> wrapping.)
>> extern unsigned int h1(struct key const *), h2(struct key const *);
>>
>> /// Calculate k hash values from two underlying hash functions
>> /// h1() and h2() using enhanced double hashing.  On return,
>> ///     hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6.
>> /// Takes advantage of automatic wrapping (modular reduction)
>> /// of unsigned types in C.
>> void ext_dbl_hash(struct key const *x, unsigned int hashes[], unsigned
>> int n)
>> {
>>     unsigned int a = h1(x), b = h2(x), i;
>>
>>     for (i = 0; i < n; i++) {
>>         hashes[i] = a;
>>         a += b; // Add quadratic difference to get cubic
>>         b += i; // Add linear difference to get quadratic
>>                 // i++ adds constant difference to get linear
>>     }
>> }
>>
>> And it is based on code from the paper "Bloom Filters in Probabilistic
>> Verification" by Peter C.
>> Dillinger and Panagiotis Manolios (
>> https://www.khoury.northeastern.edu/~pete/pub/bloom-filters-verification.pdf
>> )
>>
>> But the code that appears on that paper is:
>>
>> x, y := a(δ) MOD m, b(δ) MOD m
>> f[0] := x
>> for i := 1 .. k-1
>>     x := (x + y) MOD m
>>     y := (y + i) MOD m
>>     f[i] := x
>>
>> which does the assignments to x, y (a, b in the wikipedia article) in a
>> different order.
>>
>> The same algorithm appears the thesis "Adaptive Approximate State Storage"
>> (http://peterd.org/pcd-diss.pdf)
>>
>> algorithm " Enhanced DoubleHashing"
>>
>> input x : "the element being added or queried"
>> input m : "number of bits in bit vector"
>> input k : "number of indices to compute"
>> output g[0 .. k - 1] : "array of indices"
>> uses h1, h2 : "hash functions"
>>
>> begin
>>     a := h1(x)
>>     b := h2(x)
>>     g[0] := a
>>     for i := 1 .. k - 1
>>     begin
>>         a := (a + b) MOD m
>>         b := (b + i) MOD m
>>         g[i] := a
>>     end
>> end
>>
>> It's easy to see that the computation the enhanced double hash computes
>>
>>     hash[i] = h1(x) + i * h2(x) + (i^3 - i) / 6
>>
>> if we use a=h1(x) and b=h2(x), generates the sequence:
>>
>>     a, a + b, a + 2 * b + 1, a + 3 * b + 4, a + 4 * b + 10, ...
>>
>> And both your implementation and the wikipedia article generates
>>
>>     a, a + b, a + 2 * b, a + 3 * b + 1, a + 4 * b + 4, ...
>>
>> So the last term in the sum lags one step behind.
>>
>> The solution, I think, is as simple as swapping the assignments of index
>> and inc in your implementation.
>>
>> I suppose the impact of this is nil, but I'm not an expert for judging
>> that.
>>
>> Thanks,
>>
>> Juan Manuel Gimeno
>>
>

Reply via email to