I am also curious about the current MinHash implementation. In the current implementation the vector TF or TF-IDF weights are hashed via Vector.Element.get(). Jeff Hansen pointed out in a previous thread on the mailinglist that this is incorrect and the index should be hashed because the index identifies an N-gram in the dictionary.
However in this blog http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html hashing is done directly on the N-gram itself. How is this algorithm supposed to work? Thoughts? On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: > Lance, > > I don't think this problem is confined to DisplayMinHash alone, even the > regular MinHash clustering doesn't seem right when run on the Reuter's > dataset (using cluster-reuters.sh) and a few other data sets I had tried. I > am playing with the the keyGroups values to determine if that improves the > quality of clustering. > > > > ________________________________ > From: Lance Norskog <goks...@gmail.com> > To: dev@mahout.apache.org > Sent: Monday, January 16, 2012 8:46 PM > Subject: Re: Minhash review > > Minhash works better and better with the more dimensions you throw at > it, right? All of the Display classes use two dimensions. Would > MinHash more sense if it uses a few hundred dimensions and then > collapse down to two? Maybe with SVD? > > Are there other clustering algorithms that have this problem? > > On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> wrote: >> I've had a sneaking suspicion for a while now that our minhash clustering >> isn't right. Looking at the DisplayMinHash contributed issue further >> cements this feeling, but I can't quite put my finger on what is wrong. I >> don't think it is completely true to the Broder paper, but that doesn't >> necessarily make it wrong. It's just both the cluster-reuters output and >> the DisplayMinHash output seem to be of pretty low quality. My gut says it >> has to do with the group stuff whereby we create the signatures. >> >> I think before we do 0.6 it could use a few eyeballs. >> >> > > > > -- > Lance Norskog > goks...@gmail.com