Cristian Galán created MAHOUT-1634:
--------------------------------------

             Summary: ALS don't work when it adds new files in Distributed Cache
                 Key: MAHOUT-1634
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1634
             Project: Mahout
          Issue Type: Bug
          Components: Collaborative Filtering
    Affects Versions: 0.9
         Environment: Cloudera 5.1 VM, eclipse, zookeeper
            Reporter: Cristian Galán
             Fix For: 1.0


ALS algorithm uses distributed cache to temp files, but the distributed cache 
have other uses too, especially to add dependencies
(http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/),
 so when in a hadoop's job we add a dependency library (or other file) ALS 
fails because it reads ALL files in Distribution Cache without distinction.

This occurs in the project of my company because we need to add Mahout 
dependencies (mahout, lucene,...) in an hadoop Configuration to run Mahout's 
jobs, otherwise the Mahout's job fails because it don't find the dependencies.

I propose two options (I think two valid options):
1) Eliminate all .jar in the return of HadoopUtil.getCacheFiles
2) Elliminate all Path object distinct of /part-*

I prefer the first because it's less aggressive, and I think this solution will 
be resolve all problems.

Pd: Sorry if my english is wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to