[
https://issues.apache.org/jira/browse/MAHOUT-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy Lyubimov updated MAHOUT-1629:
-------------------------------------
Assignee: Suneel Marthi
> Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder
> as --input
> ----------------------------------------------------------------------------------------
>
> Key: MAHOUT-1629
> URL: https://issues.apache.org/jira/browse/MAHOUT-1629
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.9
> Environment: AWS EMR with AMI 3.2.3
> Reporter: Markus Paaso
> Assignee: Suneel Marthi
> Labels: legacy
>
> When running 'mahout cvb' command on AWS EMR having option --input with value
> like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case)
> the content of doc-topic output is really non-sense. It seems like the docIds
> in doc-topic output are shuffled. But the topic model output (p(term|topic)
> for each topic) looks still fine.
> The workaround is to first copy input files from s3 to cluster's hdfs with
> command:
> {code:none}hadoop fs -cp s3://mybucket/input /input{code}
> and then running mahout cvb with option --input /input .
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)