I am also curious about the current MinHash implementation. In the
current implementation the vector TF or TF-IDF weights are hashed via
Vector.Element.get(). Jeff Hansen pointed out in a previous thread on
the mailinglist that this is incorrect and the index should be hashed
because the index identifies an N-gram in the dictionary.

However in this blog

http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html

hashing is done directly on the N-gram itself.

How is this algorithm supposed to work? Thoughts?

On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote:
> Lance,
>
> I don't think this problem is confined to DisplayMinHash alone, even the 
> regular MinHash clustering doesn't seem right when run on the Reuter's 
> dataset (using cluster-reuters.sh) and a few other data sets I had tried.  I 
> am playing with the the keyGroups values to determine if that improves the 
> quality of clustering.
>
>
>
> ________________________________
>  From: Lance Norskog <goks...@gmail.com>
> To: dev@mahout.apache.org
> Sent: Monday, January 16, 2012 8:46 PM
> Subject: Re: Minhash review
>
> Minhash works better and better with the more dimensions you throw at
> it, right? All of the Display classes use two dimensions. Would
> MinHash more sense if it uses a few hundred dimensions and then
> collapse down to two? Maybe with SVD?
>
> Are there other clustering algorithms that have this problem?
>
> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> wrote:
>> I've had a sneaking suspicion for a while now that our minhash clustering 
>> isn't right.  Looking at the DisplayMinHash contributed issue further 
>> cements this feeling, but I can't quite put my finger on what is wrong.  I 
>> don't think it is completely true to the Broder paper, but that doesn't 
>> necessarily make it wrong.  It's just both the cluster-reuters output and 
>> the DisplayMinHash output seem to be of pretty low quality.  My gut says it 
>> has to do with the group stuff whereby we create the signatures.
>>
>> I think before we do 0.6 it could use a few eyeballs.
>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com

Reply via email to