Re: Possible issue in MinHashMapper

Sean Owen Mon, 30 Jul 2012 05:03:24 -0700

Yes I know what you mean. In my understanding you typically apply minhash
to a large sparse vector that acts like a bit set, where the index is
really the set member. There you want to hash the index, and doing so by
considering all indices would be completely wrong.

Here I think the set elements are the values. and the vectors seem to be
treated as a list, really. So I'm not surprised they're treated as dense. I
still think it's a good idea to iterate over non-default items, since I'm
not clear whether the implementation is guaranteed to accept only dense
input vectors, where all dimensions have a value -- in which case it
doesn't matter and the current implementation is OK.

Ankur are you still around to answer? I think that's a good guess as to the
original intent.

On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[email protected]>wrote:

> I agree about performance effect of iterating over zeros. But the
> correctness effect comes due to hashing values of the element and not its
> index (at least in documents and words example).
>
> Do you agree?

Re: Possible issue in MinHashMapper

Reply via email to