Hello,

It seems to me that there is an issue in MinHashMapper class. In map
method, the loop goes over the elements in the vector. In many cases the
instance of Vector abstract class is a SparseVector and iteration would
meant to be over non-zeros values (e.g., documents as a sparse vector of
words).  However, in current implementation the iteration will go over all
the elements including zero-valued (as using the vector iterator by
default). This can produce meaningless clustering. In addition, in this
case I think we should hash the index of the element rather than it's
value.

Can somebody confirm or disprove this?

Thanks,

Best regards,
Elena Smirnova.

Reply via email to