Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-10 Thread Debasish Das
I tested 2 different implementations to generate the predicted ranked
list...The first version uses a cartesian of user and product features and
then generates a predicted value for each (user,product) key...

The second version does a collect on the skinny matrix (most likely
products) and then broadcasts it to every node which computes the predicted
value...

cartesian is slower than the broadcast version...but in the broadcast
version also the shuffle time is significant..Bottleneck is the groupBy on
(user,product) composite key followed by local sort to generate topK...

The third version I thought of was to use topK predict API but this works
only if topK is bounded by a small number...If topK is large (say 100K) it
does not work since then it is bounded by master memory...

The block-wise cross product idea will optimize the groupBy right ? we
break user and feature matrices into blocks (re-use ALS partitioning) and
then in place of using (user,product) as a key use (userBlock,
productBlock) as key...Does this help improve in shuffle size ?


On Thu, Nov 6, 2014 at 5:07 PM, Xiangrui Meng men...@gmail.com wrote:

 There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066

 The easiest case is when one side is small. If both sides are large,
 this is a super-expensive operation. We can do block-wise cross
 product and then find top-k for each user.

 Best,
 Xiangrui

 On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  model.recommendProducts can only be called from the master then ? I have
 a
  set of 20% users on whom I am performing the test...the 20% users are in
 a
  RDD...if I have to collect them all to master node and then call
  model.recommendProducts, that's a issue...
 
  Any idea how to optimize this so that we can calculate MAP statistics on
  large samples of data ?
 
 
  On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote:
 
  ALS model contains RDDs. So you cannot put `model.recommendProducts`
  inside a RDD closure `userProductsRDD.map`. -Xiangrui
 
  On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com
  wrote:
   I reproduced the problem in mllib tests ALSSuite.scala using the
   following
   functions:
  
   val arrayPredict = userProductsRDD.map{case(user,product) =
  
val recommendedProducts = model.recommendProducts(user,
   products)
  
val productScore = recommendedProducts.find{x=x.product ==
   product}
  
 require(productScore != None)
  
 productScore.get
  
   }.collect
  
   arrayPredict.foreach { elem =
  
 if (allRatings.get(elem.user, elem.product) != elem.rating)
  
 fail(Prediction APIs don't match)
  
   }
  
   If the usage of model.recommendProducts is correct, the test fails
 with
   the
   same error I sent before...
  
   org.apache.spark.SparkException: Job aborted due to stage failure:
 Task
   0 in
   stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in
 stage
   316.0 (TID 79, localhost): scala.MatchError: null
  
  
 org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
  
  
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
  
   It is a blocker for me and I am debugging it. I will open up a JIRA if
   this
   is indeed a bug...
  
   Do I have to cache the models to make userFeatures.lookup(user).head
 to
   work
   ?
  
  
   On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com
 wrote:
  
   Was user presented in training? We can put a check there and return
   NaN if the user is not included in the model. -Xiangrui
  
   On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das 
 debasish.da...@gmail.com
   wrote:
Hi,
   
I am testing MatrixFactorizationModel.predict(user: Int, product:
Int)
but
the code fails on userFeatures.lookup(user).head
   
In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)])
 has
been
called and in all the test-cases that API has been used...
   
I can perhaps refactor my code to do the same but I was wondering
whether
people test the lookup(user) version of the code..
   
Do I need to cache the model to make it work ? I think right now
default
is
STORAGE_AND_DISK...
   
Thanks.
Deb
  
  
 
 



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
I reproduced the problem in mllib tests ALSSuite.scala using the following
functions:

val arrayPredict = userProductsRDD.map{case(user,product) =

 val recommendedProducts = model.recommendProducts(user, products)

 val productScore = recommendedProducts.find{x=x.product == product
}

  require(productScore != None)

  productScore.get

}.collect

arrayPredict.foreach { elem =

  if (allRatings.get(elem.user, elem.product) != elem.rating)

  fail(Prediction APIs don't match)

}

If the usage of model.recommendProducts is correct, the test fails with the
same error I sent before...

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
316.0 (TID 79, localhost): scala.MatchError: null

org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)

It is a blocker for me and I am debugging it. I will open up a JIRA if this
is indeed a bug...

Do I have to cache the models to make userFeatures.lookup(user).head to
work ?

On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:

 Was user presented in training? We can put a check there and return
 NaN if the user is not included in the model. -Xiangrui

 On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
 but
  the code fails on userFeatures.lookup(user).head
 
  In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
  called and in all the test-cases that API has been used...
 
  I can perhaps refactor my code to do the same but I was wondering whether
  people test the lookup(user) version of the code..
 
  Do I need to cache the model to make it work ? I think right now default
 is
  STORAGE_AND_DISK...
 
  Thanks.
  Deb



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
ALS model contains RDDs. So you cannot put `model.recommendProducts`
inside a RDD closure `userProductsRDD.map`. -Xiangrui

On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote:
 I reproduced the problem in mllib tests ALSSuite.scala using the following
 functions:

 val arrayPredict = userProductsRDD.map{case(user,product) =

  val recommendedProducts = model.recommendProducts(user, products)

  val productScore = recommendedProducts.find{x=x.product ==
 product}

   require(productScore != None)

   productScore.get

 }.collect

 arrayPredict.foreach { elem =

   if (allRatings.get(elem.user, elem.product) != elem.rating)

   fail(Prediction APIs don't match)

 }

 If the usage of model.recommendProducts is correct, the test fails with the
 same error I sent before...

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
 stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
 316.0 (TID 79, localhost): scala.MatchError: null

 org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)

 It is a blocker for me and I am debugging it. I will open up a JIRA if this
 is indeed a bug...

 Do I have to cache the models to make userFeatures.lookup(user).head to work
 ?


 On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:

 Was user presented in training? We can put a check there and return
 NaN if the user is not included in the model. -Xiangrui

 On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
  but
  the code fails on userFeatures.lookup(user).head
 
  In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
  been
  called and in all the test-cases that API has been used...
 
  I can perhaps refactor my code to do the same but I was wondering
  whether
  people test the lookup(user) version of the code..
 
  Do I need to cache the model to make it work ? I think right now default
  is
  STORAGE_AND_DISK...
 
  Thanks.
  Deb



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Debasish Das
model.recommendProducts can only be called from the master then ? I have a
set of 20% users on whom I am performing the test...the 20% users are in a
RDD...if I have to collect them all to master node and then call
model.recommendProducts, that's a issue...

Any idea how to optimize this so that we can calculate MAP statistics on
large samples of data ?


On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote:

 ALS model contains RDDs. So you cannot put `model.recommendProducts`
 inside a RDD closure `userProductsRDD.map`. -Xiangrui

 On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  I reproduced the problem in mllib tests ALSSuite.scala using the
 following
  functions:
 
  val arrayPredict = userProductsRDD.map{case(user,product) =
 
   val recommendedProducts = model.recommendProducts(user,
 products)
 
   val productScore = recommendedProducts.find{x=x.product ==
  product}
 
require(productScore != None)
 
productScore.get
 
  }.collect
 
  arrayPredict.foreach { elem =
 
if (allRatings.get(elem.user, elem.product) != elem.rating)
 
fail(Prediction APIs don't match)
 
  }
 
  If the usage of model.recommendProducts is correct, the test fails with
 the
  same error I sent before...
 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0 in
  stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
  316.0 (TID 79, localhost): scala.MatchError: null
 
  org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
 
  It is a blocker for me and I am debugging it. I will open up a JIRA if
 this
  is indeed a bug...
 
  Do I have to cache the models to make userFeatures.lookup(user).head to
 work
  ?
 
 
  On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Was user presented in training? We can put a check there and return
  NaN if the user is not included in the model. -Xiangrui
 
  On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
  wrote:
   Hi,
  
   I am testing MatrixFactorizationModel.predict(user: Int, product: Int)
   but
   the code fails on userFeatures.lookup(user).head
  
   In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
   been
   called and in all the test-cases that API has been used...
  
   I can perhaps refactor my code to do the same but I was wondering
   whether
   people test the lookup(user) version of the code..
  
   Do I need to cache the model to make it work ? I think right now
 default
   is
   STORAGE_AND_DISK...
  
   Thanks.
   Deb
 
 



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066

The easiest case is when one side is small. If both sides are large,
this is a super-expensive operation. We can do block-wise cross
product and then find top-k for each user.

Best,
Xiangrui

On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote:
 model.recommendProducts can only be called from the master then ? I have a
 set of 20% users on whom I am performing the test...the 20% users are in a
 RDD...if I have to collect them all to master node and then call
 model.recommendProducts, that's a issue...

 Any idea how to optimize this so that we can calculate MAP statistics on
 large samples of data ?


 On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote:

 ALS model contains RDDs. So you cannot put `model.recommendProducts`
 inside a RDD closure `userProductsRDD.map`. -Xiangrui

 On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  I reproduced the problem in mllib tests ALSSuite.scala using the
  following
  functions:
 
  val arrayPredict = userProductsRDD.map{case(user,product) =
 
   val recommendedProducts = model.recommendProducts(user,
  products)
 
   val productScore = recommendedProducts.find{x=x.product ==
  product}
 
require(productScore != None)
 
productScore.get
 
  }.collect
 
  arrayPredict.foreach { elem =
 
if (allRatings.get(elem.user, elem.product) != elem.rating)
 
fail(Prediction APIs don't match)
 
  }
 
  If the usage of model.recommendProducts is correct, the test fails with
  the
  same error I sent before...
 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task
  0 in
  stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage
  316.0 (TID 79, localhost): scala.MatchError: null
 
  org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
 
  org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
 
  It is a blocker for me and I am debugging it. I will open up a JIRA if
  this
  is indeed a bug...
 
  Do I have to cache the models to make userFeatures.lookup(user).head to
  work
  ?
 
 
  On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Was user presented in training? We can put a check there and return
  NaN if the user is not included in the model. -Xiangrui
 
  On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com
  wrote:
   Hi,
  
   I am testing MatrixFactorizationModel.predict(user: Int, product:
   Int)
   but
   the code fails on userFeatures.lookup(user).head
  
   In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has
   been
   called and in all the test-cases that API has been used...
  
   I can perhaps refactor my code to do the same but I was wondering
   whether
   people test the lookup(user) version of the code..
  
   Do I need to cache the model to make it work ? I think right now
   default
   is
   STORAGE_AND_DISK...
  
   Thanks.
   Deb
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Xiangrui Meng
Was user presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui

On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
 the code fails on userFeatures.lookup(user).head

 In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
 called and in all the test-cases that API has been used...

 I can perhaps refactor my code to do the same but I was wondering whether
 people test the lookup(user) version of the code..

 Do I need to cache the model to make it work ? I think right now default is
 STORAGE_AND_DISK...

 Thanks.
 Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org