(Does anyone have edit rights to delete this wiki please? It appears to be just wrong.)
https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[email protected]> wrote: > From my reading of the code and the unit test, that is totally different > from the expected input of MinHashDriver. So I think the example should just > be deleted. > > You would have to convert this to something like: > > {0:96037,1:114292,2:119745,...} > > to make it work correctly. > > > On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[email protected]> > wrote: >> >> The input to the MinHashDriver are tf-vectors, e.g.: >> hdfs -text /..../tf-vectors/part-r-00000 >> 10020926 >> >> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0} >> 10020927 >> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0} >> 10020928 >> >> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0} >> 10020929 >> >> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0} >> >> So the vectors are sparse and indexed by wordId. Clearly, this example >> doesn't fit into current implementation. >> >> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[email protected]> wrote: >> >> > Hmm yeah what's the output format of that job? At first glance it looks >> > like it makes a dictionary and fills in sequentially with term >> > frequencies. >> > If so, applying min-hash this way isn't wildly wrong but I still don't >> > think it works. >> > >> > Can anyone who knows this code at all weigh in? it's not my area... >> > >> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[email protected] >> > >wrote: >> > >> > > If vectors are treated as dense, then we have to modify the example >> > > given >> > > for this class, which clearly talks about documents and words: >> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering
