[jira] [Updated] (MAHOUT-1629) Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input

Andrew Palumbo (JIRA) Thu, 05 Mar 2015 18:26:15 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Palumbo updated MAHOUT-1629:
-----------------------------------
    Labels: legacy  (was: )

> Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder 
> as --input
> ----------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1629
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1629
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>         Environment: AWS EMR with AMI 3.2.3
>            Reporter: Markus Paaso
>              Labels: legacy
>
> When running 'mahout cvb' command on AWS EMR having option --input with value 
> like s3://mybucket/input/ or s3://mybucket/input/* (7 input files in my case) 
> the content of doc-topic output is really non-sense. It seems like the docIds 
> in doc-topic output are shuffled. But the topic model output (p(term|topic) 
> for each topic) looks still fine.
> The workaround is to first copy input files from s3 to cluster's hdfs with 
> command:
>  {code:none}hadoop fs -cp s3://mybucket/input /input{code}
> and then running mahout cvb with option --input /input .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAHOUT-1629) Mahout cvb on AWS EMR: p(topic|docId) doesn't make sense when using s3 folder as --input

Reply via email to