Re: [SPARK ML] Minhash integer overflow

Sean Owen Fri, 06 Jul 2018 23:46:22 -0700

I think it probably still does its.job; the hash value can just be
negative. It is likely to be very slightly biased though. Because the
intent doesn't seem to be to allow the overflow it's worth changing to use
longs for the calculation.


On Fri, Jul 6, 2018, 8:36 PM jiayuanm <[email protected]> wrote:

> Hi everyone,
>
> I was playing around with LSH/Minhash module from spark ml module. I
> noticed
> that hash computation is done with Int (see
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
> ).
> Since "a" and "b" are from a uniform distribution of [1,
> MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
> it's likely for the multiplication to cause Int overflow with a large
> sparse
> input vector.
>
> I wonder if this is a bug or intended. If it's a bug, one way to fix it is
> to compute hashes with Long and insert a couple of mod
> MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be
> smaller
> than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
> to use BigInteger.
>
> Let me know what you think.
>
> Thanks,
> Jiayuan
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: [SPARK ML] Minhash integer overflow

Reply via email to