Minhash works better and better with the more dimensions you throw at
it, right? All of the Display classes use two dimensions. Would
MinHash more sense if it uses a few hundred dimensions and then
collapse down to two? Maybe with SVD?

Are there other clustering algorithms that have this problem?

On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> wrote:
> I've had a sneaking suspicion for a while now that our minhash clustering 
> isn't right.  Looking at the DisplayMinHash contributed issue further cements 
> this feeling, but I can't quite put my finger on what is wrong.  I don't 
> think it is completely true to the Broder paper, but that doesn't necessarily 
> make it wrong.  It's just both the cluster-reuters output and the 
> DisplayMinHash output seem to be of pretty low quality.  My gut says it has 
> to do with the group stuff whereby we create the signatures.
>
> I think before we do 0.6 it could use a few eyeballs.
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to