I agree about performance effect of iterating over zeros. But the
correctness effect comes due to hashing values of the element and not its
index (at least in documents and words example).

Do you agree?

On Mon, Jul 30, 2012 at 11:58 AM, Sean Owen <[email protected]> wrote:

> I think that's right, though I think the effect on correctness is quite
> small, but the effect on performance is large. This will always hash zero
> even if zero were not really present in the vector. That is not likely to
> produce the smallest hash value though.
>
> Hashing all those zeroes is wasteful an I'm not clear whether it is
> demanded by the semantics. I suppose I'd also like the author to confirm
> whether this can be changed to look at only non-default values.
>
> On Mon, Jul 30, 2012 at 9:35 AM, Elena Smirnova <[email protected]
> >wrote:
>
> > Hello,
> >
> > It seems to me that there is an issue in MinHashMapper class. In map
> > method, the loop goes over the elements in the vector. In many cases the
> > instance of Vector abstract class is a SparseVector and iteration would
> > meant to be over non-zeros values (e.g., documents as a sparse vector of
> > words).  However, in current implementation the iteration will go over
> all
> > the elements including zero-valued (as using the vector iterator by
> > default). This can produce meaningless clustering. In addition, in this
> > case I think we should hash the index of the element rather than it's
> > value.
> >
> > Can somebody confirm or disprove this?
> >
> > Thanks,
> >
> > Best regards,
> > Elena Smirnova.
> >
>

Reply via email to