Re: MatrixFactorizationModel predict(Int, Int) API
I tested 2 different implementations to generate the predicted ranked list...The first version uses a cartesian of user and product features and then generates a predicted value for each (user,product) key... The second version does a collect on the skinny matrix (most likely products) and then broadcasts it to every node which computes the predicted value... cartesian is slower than the broadcast version...but in the broadcast version also the shuffle time is significant..Bottleneck is the groupBy on (user,product) composite key followed by local sort to generate topK... The third version I thought of was to use topK predict API but this works only if topK is bounded by a small number...If topK is large (say 100K) it does not work since then it is bounded by master memory... The block-wise cross product idea will optimize the groupBy right ? we break user and feature matrices into blocks (re-use ALS partitioning) and then in place of using (user,product) as a key use (userBlock, productBlock) as key...Does this help improve in shuffle size ? On Thu, Nov 6, 2014 at 5:07 PM, Xiangrui Meng men...@gmail.com wrote: There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote: model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb
Re: MatrixFactorizationModel predict(Int, Int) API
I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product } require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb
Re: MatrixFactorizationModel predict(Int, Int) API
ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: MatrixFactorizationModel predict(Int, Int) API
model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb
Re: MatrixFactorizationModel predict(Int, Int) API
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote: model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: MatrixFactorizationModel predict(Int, Int) API
Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org