The input to the MinHashDriver are tf-vectors, e.g.:
hdfs -text /..../tf-vectors/part-r-00000
10020926
{96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
10020927
{53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
10020928
{53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
10020929
{53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}So the vectors are sparse and indexed by wordId. Clearly, this example doesn't fit into current implementation. On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[email protected]> wrote: > Hmm yeah what's the output format of that job? At first glance it looks > like it makes a dictionary and fills in sequentially with term frequencies. > If so, applying min-hash this way isn't wildly wrong but I still don't > think it works. > > Can anyone who knows this code at all weigh in? it's not my area... > > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[email protected] > >wrote: > > > If vectors are treated as dense, then we have to modify the example given > > for this class, which clearly talks about documents and words: > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[email protected]> wrote: > > > > > Yes I know what you mean. In my understanding you typically apply > minhash > > > to a large sparse vector that acts like a bit set, where the index is > > > really the set member. There you want to hash the index, and doing so > by > > > considering all indices would be completely wrong. > > > > > > Here I think the set elements are the values. and the vectors seem to > be > > > treated as a list, really. So I'm not surprised they're treated as > > dense. I > > > still think it's a good idea to iterate over non-default items, since > I'm > > > not clear whether the implementation is guaranteed to accept only dense > > > input vectors, where all dimensions have a value -- in which case it > > > doesn't matter and the current implementation is OK. > > > > > > Ankur are you still around to answer? I think that's a good guess as to > > the > > > original intent. > > > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[email protected] > > > >wrote: > > > > > > > I agree about performance effect of iterating over zeros. But the > > > > correctness effect comes due to hashing values of the element and not > > its > > > > index (at least in documents and words example). > > > > > > > > Do you agree? > > > > > >
