I think that's right, though I think the effect on correctness is quite small, but the effect on performance is large. This will always hash zero even if zero were not really present in the vector. That is not likely to produce the smallest hash value though.
Hashing all those zeroes is wasteful an I'm not clear whether it is demanded by the semantics. I suppose I'd also like the author to confirm whether this can be changed to look at only non-default values. On Mon, Jul 30, 2012 at 9:35 AM, Elena Smirnova <[email protected]>wrote: > Hello, > > It seems to me that there is an issue in MinHashMapper class. In map > method, the loop goes over the elements in the vector. In many cases the > instance of Vector abstract class is a SparseVector and iteration would > meant to be over non-zeros values (e.g., documents as a sparse vector of > words). However, in current implementation the iteration will go over all > the elements including zero-valued (as using the vector iterator by > default). This can produce meaningless clustering. In addition, in this > case I think we should hash the index of the element rather than it's > value. > > Can somebody confirm or disprove this? > > Thanks, > > Best regards, > Elena Smirnova. >
