Thanks Suneel, I think it would also be interesting for Mahout users to have both use cases. From my experience (Information Retrieval), the use case with hashing indexes is quite popular..
For example, we can define a parameter in MinHashDriver job that says if we want to hash values or indexes of the vectors. That way we can keep unit tests functional and add similar one for hashing indexes. Also we could also add back the example. What do you think? On Tue, Jul 31, 2012 at 12:46 AM, Suneel Marthi <[email protected]>wrote: > Done. > > > > ________________________________ > From: Suneel Marthi <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Monday, July 30, 2012 6:44 PM > Subject: Re: Possible issue in MinHashMapper > > Sean, > > I'll take care of this, I added this sometime last year but was never > convinced that it ever worked right. > > > > ________________________________ > From: Sean Owen <[email protected]> > To: [email protected] > Sent: Monday, July 30, 2012 5:18 PM > Subject: Re: Possible issue in MinHashMapper > > (Does anyone have edit rights to delete this wiki please? It appears > to be just wrong.) > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering > > On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[email protected]> wrote: > > From my reading of the code and the unit test, that is totally different > > from the expected input of MinHashDriver. So I think the example should > just > > be deleted. > > > > You would have to convert this to something like: > > > > {0:96037,1:114292,2:119745,...} > > > > to make it work correctly. > > > > > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[email protected]> > > wrote: > >> > >> The input to the MinHashDriver are tf-vectors, e.g.: > >> hdfs -text /..../tf-vectors/part-r-00000 > >> 10020926 > >> > >> > {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} > >> 10020927 > >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} > >> 10020928 > >> > >> > {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} > >> 10020929 > >> > >> > {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} > >> > >> So the vectors are sparse and indexed by wordId. Clearly, this example > >> doesn't fit into current implementation. > >> > >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[email protected]> wrote: > >> > >> > Hmm yeah what's the output format of that job? At first glance it > looks > >> > like it makes a dictionary and fills in sequentially with term > >> > frequencies. > >> > If so, applying min-hash this way isn't wildly wrong but I still don't > >> > think it works. > >> > > >> > Can anyone who knows this code at all weigh in? it's not my area... > >> > > >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[email protected] > >> > >wrote: > >> > > >> > > If vectors are treated as dense, then we have to modify the example > >> > > given > >> > > for this class, which clearly talks about documents and words: > >> > > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering >
