increasing concurrency of saveAsNewAPIHadoopFile?
I'm trying to write a JavaPairRDD to a downstream database using saveAsNewAPIHadoopFile with a custom OutputFormat and the process is pretty slow. Is there a way to boost the concurrency of the save process? For example, something like splitting the RDD into multiple smaller RDDs and using Java threads to write the data out? That seems foreign to the way Spark works so not sure if there's a better way.
getting started with mllib.recommendation.ALS
Question on the input and output for ALS.train() and MatrixFactorizationModel.predict(). My input is list of Ratings(user_id, product_id, rating) and my ratings are one a scale of 1-5 (inclusive). When I compute predictions over the superset of all (user_id, product_id) pairs, the ratings produced are on a different scale. The question is this: do I need to normalize the data coming out of predict() to my own scale or does the input need to be different? Thanks!
Re: getting started with mllib.recommendation.ALS
Thanks Sean. I realized that I was supplying train() with a very low rank so I will retry with something higher and then play with lambda as-needed. On Tue, Jun 10, 2014 at 4:58 PM, Sean Owen so...@cloudera.com wrote: For trainImplicit(), the output is an approximation of a matrix of 0s and 1s, so the values are generally (not always) in [0,1] But for train(), you should be predicting the original input matrix as-is, as I understand. You should get output in about the same range as the input but again not necessarily 1-5. If it's really different, you could be underfitting. Try less lambda, more features? On Tue, Jun 10, 2014 at 4:59 PM, Sandeep Parikh sand...@clusterbeep.org wrote: Question on the input and output for ALS.train() and MatrixFactorizationModel.predict(). My input is list of Ratings(user_id, product_id, rating) and my ratings are one a scale of 1-5 (inclusive). When I compute predictions over the superset of all (user_id, product_id) pairs, the ratings produced are on a different scale. The question is this: do I need to normalize the data coming out of predict() to my own scale or does the input need to be different? Thanks!
Re: Java RDD structure for Matrix predict?
Wisely, is mapToPair in Spark 0.9.1 or 1.0? I'm running the former and didn't see that method available. I think the issue is that predict() is expecting an RDD containing a tuple of ints and not Integers. So if I use JavaPairRDDObject,Object with my original code snippet, things seem to at least compile for now. On Tue, May 27, 2014 at 6:40 PM, giive chen thegi...@gmail.com wrote: Hi Sandeep I think you should use testRatings.mapToPair instead of testRatings.map. So the code should be JavaPairRDDInteger,Integer usersProducts = training.mapToPair( new PairFunctionRating, Integer, Integer() { public Tuple2Integer, Integer call(Rating r) throws Exception { return new Tuple2Integer, Integer(r.user(), r.product()); } } ); It works on my side. Wisely Chen On Wed, May 28, 2014 at 6:27 AM, Sandeep Parikh sand...@clusterbeep.orgwrote: I've got a trained MatrixFactorizationModel via ALS.train(...) and now I'm trying to use it to predict some ratings like so: JavaRDDRating predictions = model.predict(usersProducts.rdd()) Where usersProducts is built from an existing Ratings dataset like so: JavaPairRDDInteger,Integer usersProducts = testRatings.map( new PairFunctionRating, Integer, Integer() { public Tuple2Integer, Integer call(Rating r) throws Exception { return new Tuple2Integer, Integer(r.user(), r.product()); } } ); The problem is that model.predict(...) doesn't like usersProducts, claiming that the method doesn't accept an RDD of type Tuple2 however the docs show the method signature as follows: def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] Am I missing something? The JavaRDD is just a list of Tuple2 elements, which would match the method signature but the compile is complaining. Thanks!
Java RDD structure for Matrix predict?
I've got a trained MatrixFactorizationModel via ALS.train(...) and now I'm trying to use it to predict some ratings like so: JavaRDDRating predictions = model.predict(usersProducts.rdd()) Where usersProducts is built from an existing Ratings dataset like so: JavaPairRDDInteger,Integer usersProducts = testRatings.map( new PairFunctionRating, Integer, Integer() { public Tuple2Integer, Integer call(Rating r) throws Exception { return new Tuple2Integer, Integer(r.user(), r.product()); } } ); The problem is that model.predict(...) doesn't like usersProducts, claiming that the method doesn't accept an RDD of type Tuple2 however the docs show the method signature as follows: def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating] Am I missing something? The JavaRDD is just a list of Tuple2 elements, which would match the method signature but the compile is complaining. Thanks!