[
https://issues.apache.org/jira/browse/MAHOUT-1627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255832#comment-14255832
]
Cristian Galán commented on MAHOUT-1627:
----------------------------------------
Now I have the same problem in SSVDSolver, specifically in the method setup of
the mapper of BtJob .java.
Both ALS.class and BtJob.class (and maybe others algorithms that use
DistributedCache to pass the files) breaks when they calls an
HadoopUtil.getCache() that returns all files in Distributed Cache... In my
opinion (and this is that I go to do in my personal project for prove) is
better refactor HadoopUtil.getCache() to avoid in the return the files ended
*.jar that we have added in Distributed Cache... Or change all algorithms.
--------
I have investigated, the uses of HadoopUtil.getCache() are in the next classes:
org.apache.mahout.cf.taste.example.email.EmailUtility
org.apache.mahout.cf.taste.hadoop.als.ALS <- This issue
org.apache.mahout.classifier.df.mapreduce.Builder
org.apache.mahout.classifier.df.mapreduce.Classifier
org.apache.mahout.clustering.spectral.VectorCache
org.apache.mahout.common.HadoopUtil
org.apache.mahout.math.hadoop.stochasticsvd.ABtDenseOutJob
org.apache.mahout.math.hadoop.stochasticsvd.ABtJob
org.apache.mahout.math.hadoop.stochasticsvd.BtJob
In VectorExample, for example only reads the first file in Distributed Cache,
and for this when I use the spectral kmeans doesn't break. In Builder of
Decision Forest the same, loadConfiguration() load all Distributed Cache, but
returns a path depending of index, and the when it calls to this method is with
index = 0...
> Problem with ALS Factorizer MapReduce version when working with oozie because
> of files in distributed cache. Error: Unable to read sequence file from cache.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1627
> URL: https://issues.apache.org/jira/browse/MAHOUT-1627
> Project: Mahout
> Issue Type: Bug
> Components: Collaborative Filtering
> Affects Versions: 1.0
> Environment: Hadoop
> Reporter: Srinivasarao Daruna
>
> There is a problem with ALS Factorizer when working with distributed
> environment and oozie.
> Steps:
> 1) Built mahout 1.0 jars and picked mahout-mrlegacy jar.
> 2) I have created a Java class in which i have called
> ParallelALSFactorizationJob with respective inputs.
> 3) Submitted the job and there are list of Map Reduce jobs which got
> submitted to perform the factorization.
> 4) Job failed at MultithreadedSharingMapper with the error Unable to read
> Sequnce file "<ourprogram>.jar" pointing the code at
> org.apache.mahout.cf.taste.hadoop.als.ALS and
> readMatrixByRowsFromDistributedCache method.
> Cause: The ALS class picks up input files which are sequential files from the
> distributed cache using readMatrixByRowsFromDistributedCache method. However,
> when we are working in oozie environment, the program jar as well being
> copied to distributed cache with input files. As the ALS class trying to read
> all the files in distributed cache, it is failing when it encounters jar.
> The remedy would be setting a condition to pick files those are other than
> jars.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)