Hmm yeah what's the output format of that job? At first glance it looks
like it makes a dictionary and fills in sequentially with term frequencies.
If so, applying min-hash this way isn't wildly wrong but I still don't
think it works.

Can anyone who knows this code at all weigh in? it's not my area...

On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[email protected]>wrote:

> If vectors are treated as dense, then we have to modify the example given
> for this class, which clearly talks about documents and words:
> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
>
> On Mon, Jul 30, 2012 at 2:02 PM, Sean Owen <[email protected]> wrote:
>
> > Yes I know what you mean. In my understanding you typically apply minhash
> > to a large sparse vector that acts like a bit set, where the index is
> > really the set member. There you want to hash the index, and doing so by
> > considering all indices would be completely wrong.
> >
> > Here I think the set elements are the values. and the vectors seem to be
> > treated as a list, really. So I'm not surprised they're treated as
> dense. I
> > still think it's a good idea to iterate over non-default items, since I'm
> > not clear whether the implementation is guaranteed to accept only dense
> > input vectors, where all dimensions have a value -- in which case it
> > doesn't matter and the current implementation is OK.
> >
> > Ankur are you still around to answer? I think that's a good guess as to
> the
> > original intent.
> >
> > On Mon, Jul 30, 2012 at 12:51 PM, Elena Smirnova <[email protected]
> > >wrote:
> >
> > > I agree about performance effect of iterating over zeros. But the
> > > correctness effect comes due to hashing values of the element and not
> its
> > > index (at least in documents and words example).
> > >
> > > Do you agree?
> >
>

Reply via email to