Minhash works better and better with the more dimensions you throw at it, right? All of the Display classes use two dimensions. Would MinHash more sense if it uses a few hundred dimensions and then collapse down to two? Maybe with SVD?
Are there other clustering algorithms that have this problem? On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> wrote: > I've had a sneaking suspicion for a while now that our minhash clustering > isn't right. Looking at the DisplayMinHash contributed issue further cements > this feeling, but I can't quite put my finger on what is wrong. I don't > think it is completely true to the Broder paper, but that doesn't necessarily > make it wrong. It's just both the cluster-reuters output and the > DisplayMinHash output seem to be of pretty low quality. My gut says it has > to do with the group stuff whereby we create the signatures. > > I think before we do 0.6 it could use a few eyeballs. > > -- Lance Norskog goks...@gmail.com