Lance, I don't think this problem is confined to DisplayMinHash alone, even the regular MinHash clustering doesn't seem right when run on the Reuter's dataset (using cluster-reuters.sh) and a few other data sets I had tried. I am playing with the the keyGroups values to determine if that improves the quality of clustering.
________________________________ From: Lance Norskog <goks...@gmail.com> To: dev@mahout.apache.org Sent: Monday, January 16, 2012 8:46 PM Subject: Re: Minhash review Minhash works better and better with the more dimensions you throw at it, right? All of the Display classes use two dimensions. Would MinHash more sense if it uses a few hundred dimensions and then collapse down to two? Maybe with SVD? Are there other clustering algorithms that have this problem? On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> wrote: > I've had a sneaking suspicion for a while now that our minhash clustering > isn't right. Looking at the DisplayMinHash contributed issue further cements > this feeling, but I can't quite put my finger on what is wrong. I don't > think it is completely true to the Broder paper, but that doesn't necessarily > make it wrong. It's just both the cluster-reuters output and the > DisplayMinHash output seem to be of pretty low quality. My gut says it has > to do with the group stuff whereby we create the signatures. > > I think before we do 0.6 it could use a few eyeballs. > > -- Lance Norskog goks...@gmail.com