[
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218667#comment-14218667
]
Debasish Das edited comment on SPARK-3066 at 11/19/14 10:59 PM:
----------------------------------------------------------------
[~mengxr] as per our discussions, I added APIs for batch user and product
recommendation and MAP computation for recommending topK products for users...
Note that I don't use reservoir sampling and used your idea of filtering the
test set users for which there are no model being built...I thought reservoir
sampling should be part of a separate PR
APIs added:
recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every
user
recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for
every product
I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as
well if rest of the flow makes sense...
On examples.MovieLensALS we can activate the user map calculation using
--validateRecommendation flag:
./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class
org.apache.spark.examples.mllib.MovieLensALS
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/
Got 1000209 ratings from 6040 users on 3706 movies.
Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306.
Test users 6032 MAP 0.03798337814233403
I will run the netflix dataset and report the MAP measures for that..
On our internal datasets, I have tested for 1M users, 10K products, 120 cores,
240GB for topK users for each product and that takes around 5 mins...on an
average I generate a ranked list of 6000 users for each product...Basically
internally we are using the batch API:
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for
every product
was (Author: debasish83):
@mengxr as per our discussions, I added APIs for batch user and product
recommendation and MAP computation for recommending topK products for users...
Note that I don't use reservoir sampling and used your idea of filtering the
test set users for which there are no model being built...I thought reservoir
sampling should be part of a separate PR
APIs added:
recommendProductsForUsers(num: Int) : topK is fixed for all users
recommendProductsForUsers(userTopK: RDD[(Int, Int)]): variable topK for every
user
recommendUsersForProducts(num: Int): topK is fixed for all products
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for
every product
I used mllib BLAS for all the computation and cleaned up DoubleMatrix code from
MatrixFactorizationModel...I have not used level 3 BLAS yet...I can add that as
well if rest of the flow makes sense...
On examples.MovieLensALS we can activate the user map calculation using
--validateRecommendation flag:
./bin/spark-submit --master spark://localhost:7077 --jars scopt_2.10-3.2.0.jar
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g --class
org.apache.spark.examples.mllib.MovieLensALS
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --kryo --lambda 0.065
--validateRecommendation hdfs://localhost:8020/sandbox/movielens/
Got 1000209 ratings from 6040 users on 3706 movies.
Training: 799617, test: 200592.
Test RMSE = 0.8495476608536306.
Test users 6032 MAP 0.03798337814233403
I will run the netflix dataset and report the MAP measures for that..
On our internal datasets, I have tested for 1M users, 10K products, 120 cores,
240GB for topK users for each product and that takes around 5 mins...on an
average I generate a ranked list of 6000 users for each product...Basically
internally we are using the batch API:
recommendUsersForProducts(productTopK: RDD[(Int, Int)]): variable topK for
every product
> Support recommendAll in matrix factorization model
> --------------------------------------------------
>
> Key: SPARK-3066
> URL: https://issues.apache.org/jira/browse/SPARK-3066
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xiangrui Meng
>
> ALS returns a matrix factorization model, which we can use to predict ratings
> for individual queries as well as small batches. In practice, users may want
> to compute top-k recommendations offline for all users. It is very expensive
> but a common problem. We can do some optimization like
> 1) collect one side (either user or product) and broadcast it as a matrix
> 2) use level-3 BLAS to compute inner products
> 3) use Utils.takeOrdered to find top-k
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]