Markus Paaso created MAHOUT-1629:
------------------------------------
Summary: Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense
when using s3 folder as --input
Key: MAHOUT-1629
URL: https://issues.apache.org/jira/browse/MAHOUT-1629
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.9
Environment: AWS EMR with AMI 3.2.3
Reporter: Markus Paaso
When running 'mahout cvb' command on AWS EMR having option --input with value
like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case)
the content of doc-topic output is really non-sense. It seems like the docIds
in doc-topic output are shuffled. But the topic model output (p(term|topic) for
each topic) looks still fine.
The workaround is to first copy input files from s3 to cluster's hdfs with
command:
{code:none}hadoop fs -cp s3://mybucket/input /input{code}
and then running mahout cvb with option --input /input .
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)