I've had a sneaking suspicion for a while now that our minhash clustering isn't 
right.  Looking at the DisplayMinHash contributed issue further cements this 
feeling, but I can't quite put my finger on what is wrong.  I don't think it is 
completely true to the Broder paper, but that doesn't necessarily make it 
wrong.  It's just both the cluster-reuters output and the DisplayMinHash output 
seem to be of pretty low quality.  My gut says it has to do with the group 
stuff whereby we create the signatures.

I think before we do 0.6 it could use a few eyeballs.


Reply via email to