(Does anyone have edit rights to delete this wiki please? It appears
to be just wrong.)

https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

On Mon, Jul 30, 2012 at 1:27 PM, Sean Owen <[email protected]> wrote:
> From my reading of the code and the unit test, that is totally different
> from the expected input of MinHashDriver. So I think the example should just
> be deleted.
>
> You would have to convert this to something like:
>
> {0:96037,1:114292,2:119745,...}
>
> to make it work correctly.
>
>
> On Mon, Jul 30, 2012 at 1:25 PM, Elena Smirnova <[email protected]>
> wrote:
>>
>> The input to the MinHashDriver are tf-vectors, e.g.:
>> hdfs -text /..../tf-vectors/part-r-00000
>> 10020926
>>
>> {96037:1.0,114292:1.0,119745:1.0,64091:1.0,84488:1.0,116350:1.0,135598:1.0,53382:1.0}
>> 10020927
>> {53382:1.0,1076:1.0,116350:1.0,36446:1.0,135598:1.0,96037:1.0,84488:1.0}
>> 10020928
>>
>> {53382:1.0,58740:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,136284:1.0}
>> 10020929
>>
>> {53382:1.0,13358:1.0,116350:1.0,126929:1.0,135598:1.0,114292:1.0,76732:1.0}
>>
>> So the vectors are sparse and indexed by wordId. Clearly, this example
>> doesn't fit into current implementation.
>>
>> On Mon, Jul 30, 2012 at 2:19 PM, Sean Owen <[email protected]> wrote:
>>
>> > Hmm yeah what's the output format of that job? At first glance it looks
>> > like it makes a dictionary and fills in sequentially with term
>> > frequencies.
>> > If so, applying min-hash this way isn't wildly wrong but I still don't
>> > think it works.
>> >
>> > Can anyone who knows this code at all weigh in? it's not my area...
>> >
>> > On Mon, Jul 30, 2012 at 1:11 PM, Elena Smirnova <[email protected]
>> > >wrote:
>> >
>> > > If vectors are treated as dense, then we have to modify the example
>> > > given
>> > > for this class, which clearly talks about documents and words:
>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering

Reply via email to