If vectors are treated as dense, then we have to modify the example given for this class, which clearly talks about documents and words: https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[email protected]> wrote: > Yes I know what you mean. In my understanding you typically apply minhash > to a large sparse vector that acts like a bit set, where the index is > really the set member. There you want to hash the index, and doing so by > considering all indices would be completely wrong. > > Here I think the set elements are the values. and the vectors seem to be > treated as a list, really. So I'm not surprised they're treated as dense. I > still think it's a good idea to iterate over non-default items, since I'm > not clear whether the implementation is guaranteed to accept only dense > input vectors, where all dimensions have a value -- in which case it > doesn't matter and the current implementation is OK. > > Ankur are you still around to answer? I think that's a good guess as to the > original intent. > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[email protected] > >wrote: > > > I agree about performance effect of iterating over zeros. But the > > correctness effect comes due to hashing values of the element and not its > > index (at least in documents and words example). > > > > Do you agree? >
