Markus Paaso created MAHOUT-1629:
------------------------------------

             Summary: Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense 
when using s3 folder as --input
                 Key: MAHOUT-1629
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1629
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.9
         Environment: AWS EMR with AMI 3.2.3
            Reporter: Markus Paaso


When running 'mahout cvb' command on AWS EMR having option --input with value 
like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) 
the content of doc-topic output is really non-sense. It seems like the docIds 
in doc-topic output are shuffled. But the topic model output (p(term|topic) for 
each topic) looks still fine.

The workaround is to first copy input files from s3 to cluster's hdfs with 
command:
 {code:none}hadoop fs -cp s3://mybucket/input /input{code}
and then running mahout cvb with option --input /input .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to