Hi Suneel, I was curious about the status of the updated MinHash implementation you were planning to work on. If you have something that's workable perhaps an idea to create a Jira ticket?
Cheers, Frank On Thu, Mar 8, 2012 at 1:44 PM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: > That's correct. > > > > ________________________________ > From: Frank Scholten <fr...@frankscholten.nl> > To: dev@mahout.apache.org > Sent: Thursday, March 8, 2012 4:17 AM > Subject: Re: Minhash review > > I agree with Grant that it's good to first get a working > implementation that matches the paper. Later on we can work on other > approaches. > > So if I understand correctly the vectorization step can be skipped and > we can run SequenceFilesFromDirectory -> CollocDriver -> MinHashDriver > > Correct? > > On Thu, Mar 8, 2012 at 8:22 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: >> Frank, >> >> I modified the present MinHash to hash on the index as opposed to the >> present tf-idf weights, but the change had no impact on the output and I >> still get bad clusters. >> >> I did read the blog posting you mention and that seems to be the right >> approach (and conforms to Broder's original paper on this subject - >> http://dl.acm.org/citation.cfm?id=736184). >> >> I can work on this. Do we modify the existing minhash code to be compliant >> with Broder's paper or do we implement a different MinHash based on Broder's >> paper? >> >> Regards. >> >> >> >> ________________________________ >> From: Frank Scholten <fr...@frankscholten.nl> >> To: dev@mahout.apache.org >> Sent: Monday, March 5, 2012 9:13 AM >> Subject: Re: Minhash review >> >> I am also curious about the current MinHash implementation. In the >> current implementation the vector TF or TF-IDF weights are hashed via >> Vector.Element.get(). Jeff Hansen pointed out in a previous thread on >> the mailinglist that this is incorrect and the index should be hashed >> because the index identifies an N-gram in the dictionary. >> >> However in this blog >> >> http://notskateboarding.blogspot.com/2011/01/minhashing-is-reaaally-cool.html >> >> hashing is done directly on the N-gram itself. >> >> How is this algorithm supposed to work? Thoughts? >> >> On Tue, Jan 17, 2012 at 2:51 AM, Suneel Marthi <suneel_mar...@yahoo.com> >> wrote: >>> Lance, >>> >>> I don't think this problem is confined to DisplayMinHash alone, even the >>> regular MinHash clustering doesn't seem right when run on the Reuter's >>> dataset (using cluster-reuters.sh) and a few other data sets I had tried. >>> I am playing with the the keyGroups values to determine if that improves >>> the quality of clustering. >>> >>> >>> >>> ________________________________ >>> From: Lance Norskog <goks...@gmail.com> >>> To: dev@mahout.apache.org >>> Sent: Monday, January 16, 2012 8:46 PM >>> Subject: Re: Minhash review >>> >>> Minhash works better and better with the more dimensions you throw at >>> it, right? All of the Display classes use two dimensions. Would >>> MinHash more sense if it uses a few hundred dimensions and then >>> collapse down to two? Maybe with SVD? >>> >>> Are there other clustering algorithms that have this problem? >>> >>> On Fri, Jan 13, 2012 at 5:53 AM, Grant Ingersoll <gsing...@apache.org> >>> wrote: >>>> I've had a sneaking suspicion for a while now that our minhash clustering >>>> isn't right. Looking at the DisplayMinHash contributed issue further >>>> cements this feeling, but I can't quite put my finger on what is wrong. I >>>> don't think it is completely true to the Broder paper, but that doesn't >>>> necessarily make it wrong. It's just both the cluster-reuters output and >>>> the DisplayMinHash output seem to be of pretty low quality. My gut says >>>> it has to do with the group stuff whereby we create the signatures. >>>> >>>> I think before we do 0.6 it could use a few eyeballs. >>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com