Hi,

 

I am trying to find similar documents using mahout rowsimilarity job, I
have 7 small documents in test set.  There are no common words between
document 2 and 3, but the output shows that they are exactly similar
based on the following output. 

 

 

0       elts: {0:0.9999999999999999, 1:1.0, 4:1.0, 5:1.0, 6:1.0}

1       elts: {0:1.0, 1:0.9999999999999999, 4:1.0, 5:1.0, 6:1.0}

2       elts: {2:1.0, 3:1.0}

3       elts: {2:1.0, 3:1.0}

4       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}

5       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}

6       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:0.9999999999999999}

 

I executed the following commands to generate the above output. 

 

Step 1: bin/mahout seqdirectory - converted to sequential file format

Step 2 : mahout seq2sparse  - converted to vector format 

Step 3: bin/mahout rowed   - converted into matrix format 

Step 4 : bin/mahout rowsimilarity - computed row similarity 

Step 5:  bin/mahout vectordump  - converted output to readable format 

 

Please help me how to fix this issue.

 

Thank you for your help in advance. 

 

Seby Paul

 

Reply via email to