OK. But I preferred to modify a couple of lines of MinHashMapper rather
than writing my own spaseVectors generator:
...................
for (int i = 0; i < numHashFunctions; i++) {
Iterator it = featureVector.iterateNonZero();
while (it.hasNext()) {
Vector.Element next = (Vector.Element) it.next();
int value = next.index();
..................
It works fine.
On Mon, Jul 30, 2012 at 2:27 PM, Sean Owen <[email protected]> wrote:
> From my reading of the code and the unit test, that is totally different
> from the expected input of MinHashDriver. So I think the example should
> just be deleted.
>
> You would have to convert this to something like:
>
> {0:96037,1:114292,2:119745,...}
>
> to make it work correctly.
>
> On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[email protected]
> >wrote:
>
> > The input to the MinHashDriver are tf-vectors, e.g.:
> > hdfs -text /..../tf-vectors/part-r-00000
> > 10020926
> >
> >
> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
> > 10020927
> > {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
> > 10020928
> >
> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
> > 10020929
> >
> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
> >
> > So the vectors are sparse and indexed by wordId. Clearly, this example
> > doesn't fit into current implementation.
> >
> > On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[email protected]> wrote:
> >
> > > Hmm yeah what's the output format of that job? At first glance it looks
> > > like it makes a dictionary and fills in sequentially with term
> > frequencies.
> > > If so, applying min-hash this way isn't wildly wrong but I still don't
> > > think it works.
> > >
> > > Can anyone who knows this code at all weigh in? it's not my area...
> > >
> > > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[email protected]
> > > >wrote:
> > >
> > > > If vectors are treated as dense, then we have to modify the example
> > given
> > > > for this class, which clearly talks about documents and words:
> > > >
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
> > > >
> > > > On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[email protected]> wrote:
> > > >
> > > > > Yes I know what you mean. In my understanding you typically apply
> > > minhash
> > > > > to a large sparse vector that acts like a bit set, where the index
> is
> > > > > really the set member. There you want to hash the index, and doing
> so
> > > by
> > > > > considering all indices would be completely wrong.
> > > > >
> > > > > Here I think the set elements are the values. and the vectors seem
> to
> > > be
> > > > > treated as a list, really. So I'm not surprised they're treated as
> > > > dense. I
> > > > > still think it's a good idea to iterate over non-default items,
> since
> > > I'm
> > > > > not clear whether the implementation is guaranteed to accept only
> > dense
> > > > > input vectors, where all dimensions have a value -- in which case
> it
> > > > > doesn't matter and the current implementation is OK.
> > > > >
> > > > > Ankur are you still around to answer? I think that's a good guess
> as
> > to
> > > > the
> > > > > original intent.
> > > > >
> > > > > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > I agree about performance effect of iterating over zeros. But the
> > > > > > correctness effect comes due to hashing values of the element and
> > not
> > > > its
> > > > > > index (at least in documents and words example).
> > > > > >
> > > > > > Do you agree?
> > > > >
> > > >
> > >
> >
>