[
https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988558#action_12988558
]
steve commented on MAHOUT-371:
------------------------------
I'm trying to execute this on an amazon EC2 cluster, but i receive this error
at second job, someone can help me?
Exception in thread "main" java.lang.IllegalArgumentException: Can not create a
Path from a null string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
at org.apache.hadoop.fs.Path.<init>(Path.java:90)
at
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:85)
at
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:252)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.cf.taste.hadoop.svd.DistributedSVDItemRecommenderJob.run(DistributedSVDItemRecommenderJob.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.cf.taste.hadoop.svd.DistributedSVDItemRecommenderJob.main(DistributedSVDItemRecommenderJob.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Proposal to implement Distributed SVD++ Recommender using Hadoop
> ----------------------------------------------------------------
>
> Key: MAHOUT-371
> URL: https://issues.apache.org/jira/browse/MAHOUT-371
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Richard Simon Just
> Assignee: Sean Owen
> Attachments: svd.zip, svd.zip
>
>
> Proposal Title: [MAHOUT-371] Proposal to implement Distributed SVD++
> Recommender using Hadoop
> Student Name: Richard Simon Just
> Student E-mail:[email protected]
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> During the Netflix Prize Challenge one of the most popular forms of
> Recommender algorithm was that of Matrix Factorisation, in particular
> Singular Value Decomposition (SVD). As such this proposal looks to implement
> a distributed version of one of the most successful SVD-based recommender
> algorithms from the Netflix competition. Namely, the SVD++ algorithm.
> The SVD++ improves upon other basic SVD algorithms by incorporating implicit
> feedback[1]. That is to say that it is able to take into account not just
> explicit user preferences, but also feedback such as, in the case of a
> company like Netflix, whether a movie has been rented. Implicit feedback
> assumes that the fact of there being some correlation between the user and
> the item is more important that whether the correlation is positive or
> negative. Implicit feedback would account for an item has being rated, but
> not what the rating was.
> The implementation will include testing, in-depth documentation and a
> demo/tutorial. If there is time, I will also look to developing the algorithm
> into the timeSVD++ algorithm[2,3]. The timeSVD++ further improves the results
> of the SVD algorithm by taking into account temporal dynamics. Temporal
> dynamics addresses the way user preferences in items and their behaviour in
> how they rate items can change over time. According to [2] the gains in
> accuracy implementing timeSVD++ are significantly bigger than the gains going
> from SVD to SVD++.
> The overall project will provide three deliverables:
> 1. The basic framework for distributed SVD-based recommender
> 2. A distributed SVD++ implementation and demo
> 3. A distributed timeSVD++ implementation
> Detailed Description:
> The SVD++ algorithm uses the principle of categorising users and items into
> factors, combined with regularisation and implicit feedback to predict how
> much a user is likely to match an item. Factors are abstract categorises that
> are created from comparing the data presented. Factor values are grades
> saying how much each user/item is related to that category. For example with
> the Netflix data a factor could loosely correspond to a movie genre, a
> director or story line. The more factors used the more detailed the
> categories are likely to become, and thus the more accurate the predictions
> are likely to become.
> Implicit feedback is based on the theory that a user having any sort of
> relationship to an item is more important that whether they rated it, or what
> rating they gave. The assumption is that even if a user does not like an
> item, or has not rated it, the very fact that they choose to have some
> interaction with it indicates something about their preferences. In the
> Netflix case this would be represented by whether a user has rated a movie or
> not, it could also take into account the user's rental history.
> As well as the actual implementation of the code the project has two other
> deliverable focuses. The readability, documentation, testing of the code; and
> a full tutorial and demo of the code. It is felt that without these things
> the implementation is not really complete or fully usable.
> The recommender consists of 5 main parts. The job class that runs the code,
> an input conversion section, a training section, a re-trainer section and a
> prediction section. A brief overview of these sections follows.
> The SVD++ Classes:
> The Recommender Job class:
> This class is the foundation of the recommender and allows it to run on
> Hadoop by implementing the Tool interface through AbstractJob. This class
> will parse any user arguments and setup the jobs that will run the algorithm
> on Map/Reduce, much in the same way Mahout's other distributed recommenders,
> do such as RecommenderJob.
> Input Mapper/Reducer Classes:
> The Mapper will take the input data and convert it to key value pairs in the
> form of a hadoop Writeable. The Reducer will take the Mapper Writeables and
> create Sparse vectors. This is in keeping with Mahout's ToItemPrefsMapper and
> ToUserVectorReducer.
> Training Mapper/Reducer Classes:
> This phase creates the factor matrices that will be used to create the
> predictions later. It does this by making a prediction of a known rating
> using the SVD, comparing it against the known rating, calculating the error
> and updating the factors accordingly. The Mapper will loop through the
> training of the factor matrices. The Reducer will collect the outputs of the
> Mapper to create the dense factor matrices.
> Re-trainer Mapper/Reducer Classes:
> This is a separate job that would be used to update the factor matrices
> taking into account any new user ratings. The Mapper/Reducer follow a similar
> pattern to the training section.
> Prediction Mapper/Reducer Classes:
> This job would take the factor matrices from the training set and use them in
> the SVD to create the required predictions. Most likely this would be done
> for all user/item preferences, to create dense matrix of all user/item
> combinations. However potentially it could also be done for individual users.
> Just like training, the Mapper would handle the predictions the Reducer would
> collect them into a dense matrix.
> Timeline:
> The Warm Up/Bonding Period (<=May 23rd):
> - familiarise myself further with Mahout and Hadoop's code base and
> documentation
> - discuss with community the proposal, design and implementation as well as
> related code tests, optimisations and documentation they would like to see
> incorporated into the project
> - build a more detailed design of algorithm implementation and tweak timeline
> based on feedback
> - familiarise myself more with unit testing
> - finish building 3-4 node Hadoop cluster and play with all the examples
> Week 1 (May 24th-30th):
> - start writing the back bone of the code in the form of comments and
> skeleton code
> - implement SVDppRecommenderJob
> - start to integrate DistributedLanzcosSolver
> Week 2 (May 31st - June 6th):
> - complete DistributedLanzcosSolver integration
> - start implementing distributed training, prediction and regularisation
> Week 3 - 5 (June 7th - 27th):
> - complete implementation of distributed training, prediction and
> regularisation
> - work on testing, cleaning up code, and tying up any loose documentation ends
> - work on any documentation, tests and optimisation requested by community
> - Deliverable : basic framework for distributed SVD-based recommender
> Week 6 - 7 (June 28th-July 11th):
> - start implementation of SVD++ (keeping documentation and tests up-to-date)
> - prepare demo
> Week 8 (July 12th - 18th): Mid-Term Report by the 16th
> - complete SVD++ and iron out bugs
> - implement and document demo
> - write wiki articles and tutorial for what has been implemented including
> the demo
> Week 9 (July 19th - 25th):
> - work on any documentation, tests and optimisation requested by community
> during project
> - work on testing, cleaning up code, and tying up any loose documentation ends
> - Deliverable : Distributed SVD++ Recommender (including Demo)
> Week 10 - 11 (July 26th - Aug 8th):
> - incorporate temporal dynamics
> - write temporal dynamics documentation, including wiki article
> Week 12 (Aug 9th - 15th): Suggested Pencils Down
> - last optimisation and tidy up of code, documentation, tests and demo
> - discuss with community what comes next, consider what JIRA issues to
> contribute to
> - Deliverable: Distributed SVD++ Recommender with temporal dynamics
> Final Evaluations Hand-in: Aug 16th-20th.
> Additional Information:
> I am currently a final year undergraduate at the University of Reading, UK,
> studying BSc Applied Computer Science. As part of this degree I did a 9 month
> industrial placement programming in Java for an international non-profit
> organisation. While in this final year of study I have been fortunate enough
> to take modules in Distributed Computing, Evolutionary Computation and
> Concurrent Systems, which have proven to be the best of my University career.
> Each module required me to design and implement algorithms of either a
> distributed or machine learning nature [A], two in Java, one in C. For my
> final year project I have implemented a Gaming Persistence Platform in
> ActionScript, PHP and MySQL. It allows the user to play computer games
> integrated with the platform from different devices and/or different software
> platforms while retaining game progress across all instances. I am looking
> forward to the day when mobile computing goes distributed.
> I am a passionate user of open source software and now I am looking to become
> involved as a developer. Having enjoyed my Distributed Computing and
> Evolutionary Computation modules so much, and after reading all the
> introductory pages about the ASF I realised Mahout would be a great place to
> start. After graduation (and GSoC) I hope to continue contributing to Mahout
> while working in a related field.
> Outside of uni, when I'm not coding, I like to push my limits through
> meditation, mountaineering and running. I also love to cook!
> After my last exam on May 21st my only commitment over the summer will be
> GSoC. As such GSoC will be treated like a full-time job.
> [1] - Y. Koren, "Factorization Meets the Neighborhood: a Mulitfaceted
> Collaborative Filtering Model", ACM Press, 2008,
> http://public.research.att.com/~volinsky/netflix/kdd08koren.pdf
> [2] - Y. Koren, "Collaborative Filtering with temporal Dynamics", ACM Press,
> 2009, http://research.yahoo.com/files/kdd-fp074-koren.pdf
> [3] - R.M.Bell, Y. Koren, and C. Volinsky, "The Bellkor 2008 Solution to the
> Netflix Prize",
> http://www.netflixprize.com/assets/ProgressPrize2008_BellKor.pdf
> [A] - Example of code I've implemented as part of University degree,
> http://www.richardsimonjust.com/code/
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.