Lance,

I don't think this problem is confined to DisplayMinHash alone, even the 
regular MinHash clustering doesn't seem right when run on the Reuter's dataset 
(using cluster-reuters.sh) and a few other data sets I had tried.  I am playing 
with the the keyGroups values to determine if that improves the quality of 
clustering.



________________________________
 From: Lance Norskog <goks...@gmail.com>
To: dev@mahout.apache.org 
Sent: Monday, January 16, 2012 8:46 PM
Subject: Re: Minhash review
 
Minhash works better and better with the more dimensions you throw at
it, right? All of the Display classes use two dimensions. Would
MinHash more sense if it uses a few hundred dimensions and then
collapse down to two? Maybe with SVD?

Are there other clustering algorithms that have this problem?

On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> wrote:
> I've had a sneaking suspicion for a while now that our minhash clustering 
> isn't right.  Looking at the DisplayMinHash contributed issue further cements 
> this feeling, but I can't quite put my finger on what is wrong.  I don't 
> think it is completely true to the Broder paper, but that doesn't necessarily 
> make it wrong.  It's just both the cluster-reuters output and the 
> DisplayMinHash output seem to be of pretty low quality.  My gut says it has 
> to do with the group stuff whereby we create the signatures.
>
> I think before we do 0.6 it could use a few eyeballs.
>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to