I tested 2 different implementations to generate the predicted ranked
list...The first version uses a cartesian of user and product features and
then generates a predicted value for each (user,product) key...

The second version does a collect on the skinny matrix (most likely
products) and then broadcasts it to every node which computes the predicted
value...

cartesian is slower than the broadcast version...but in the broadcast
version also the shuffle time is significant..Bottleneck is the groupBy on
(user,product) composite key followed by local sort to generate topK...

The third version I thought of was to use topK predict API but this works
only if topK is bounded by a small number...If topK is large (say 100K) it
does not work since then it is bounded by master memory...

The block-wise cross product idea will optimize the groupBy right ? we
break user and feature matrices into blocks (re-use ALS partitioning) and
then in place of using (user,product) as a key use (userBlock,
productBlock) as key...Does this help improve in shuffle size ?


On Thu, Nov 6, 2014 at 5:07 PM, Xiangrui Meng <men...@gmail.com> wrote:

> There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066
>
> The easiest case is when one side is small. If both sides are large,
> this is a super-expensive operation. We can do block-wise cross
> product and then find top-k for each user.
>
> Best,
> Xiangrui
>
> On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
> > model.recommendProducts can only be called from the master then ? I have
> a
> > set of 20% users on whom I am performing the test...the 20% users are in
> a
> > RDD...if I have to collect them all to master node and then call
> > model.recommendProducts, that's a issue...
> >
> > Any idea how to optimize this so that we can calculate MAP statistics on
> > large samples of data ?
> >
> >
> > On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> ALS model contains RDDs. So you cannot put `model.recommendProducts`
> >> inside a RDD closure `userProductsRDD.map`. -Xiangrui
> >>
> >> On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das <debasish.da...@gmail.com>
> >> wrote:
> >> > I reproduced the problem in mllib tests ALSSuite.scala using the
> >> > following
> >> > functions:
> >> >
> >> >         val arrayPredict = userProductsRDD.map{case(user,product) =>
> >> >
> >> >          val recommendedProducts = model.recommendProducts(user,
> >> > products)
> >> >
> >> >          val productScore = recommendedProducts.find{x=>x.product ==
> >> > product}
> >> >
> >> >           require(productScore != None)
> >> >
> >> >           productScore.get
> >> >
> >> >         }.collect
> >> >
> >> >         arrayPredict.foreach { elem =>
> >> >
> >> >           if (allRatings.get(elem.user, elem.product) != elem.rating)
> >> >
> >> >           fail("Prediction APIs don't match")
> >> >
> >> >         }
> >> >
> >> > If the usage of model.recommendProducts is correct, the test fails
> with
> >> > the
> >> > same error I sent before...
> >> >
> >> > org.apache.spark.SparkException: Job aborted due to stage failure:
> Task
> >> > 0 in
> >> > stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in
> stage
> >> > 316.0 (TID 79, localhost): scala.MatchError: null
> >> >
> >> >
> org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825)
> >> >
> >> >
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81)
> >> >
> >> > It is a blocker for me and I am debugging it. I will open up a JIRA if
> >> > this
> >> > is indeed a bug...
> >> >
> >> > Do I have to cache the models to make userFeatures.lookup(user).head
> to
> >> > work
> >> > ?
> >> >
> >> >
> >> > On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng <men...@gmail.com>
> wrote:
> >> >>
> >> >> Was "user" presented in training? We can put a check there and return
> >> >> NaN if the user is not included in the model. -Xiangrui
> >> >>
> >> >> On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das <
> debasish.da...@gmail.com>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I am testing MatrixFactorizationModel.predict(user: Int, product:
> >> >> > Int)
> >> >> > but
> >> >> > the code fails on userFeatures.lookup(user).head
> >> >> >
> >> >> > In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)])
> has
> >> >> > been
> >> >> > called and in all the test-cases that API has been used...
> >> >> >
> >> >> > I can perhaps refactor my code to do the same but I was wondering
> >> >> > whether
> >> >> > people test the lookup(user) version of the code..
> >> >> >
> >> >> > Do I need to cache the model to make it work ? I think right now
> >> >> > default
> >> >> > is
> >> >> > STORAGE_AND_DISK...
> >> >> >
> >> >> > Thanks.
> >> >> > Deb
> >> >
> >> >
> >
> >
>

Reply via email to