[ https://issues.apache.org/jira/browse/SPARK-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengbing li updated SPARK-2257: -------------------------------- Description: When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i < rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i < rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been decreased, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ........... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r => ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) ........ Is this solution reasonable?? was: When I test ALS algorithm using netflix data, I find I cannot get the acurate results declared by the paper. The best MSE value is 0.9066300038109709(RMSE 0.952), which is worse than the paper's result. If I increase the number of features or the number of iterations, I will get a worse result. After I studing the paper and source code, I find a bug in the updateBlock function of ALS. orgin code is: while (i < rank) { // --- fullXtX.data(i * rank + i) += lambda i += 1 } The code doesn't consider the number of products that one user rates. So this code should be modified: while (i < rank) { //ratingsNum(index) equals the number of products that a user rates fullXtX.data(i * rank + i) += lambda * ratingsNum(index) i += 1 } After I modify code, the MSE value has been improved, this is one test result conditions: val numIterations =20 val features = 30 val model = ALS.train(trainRatings,features, numIterations, 0.06) result of modified version: MSE: Double = 0.8472313396478773 RMSE: 0.92045 results of version of 1.0 MSE: Double = 1.2680743123043832 RMSE: 1.1261 In order to add the vector ratingsNum, I want to change the InLinkBlock structure as follows: private[recommendation] case class InLinkBlock(elementIds: Array[Int], ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], Array[Double])]]) So I could calculte the vector ratingsNum in the function of makeInLinkBlock. This is the code I add in the makeInLinkBlock: ........... //added val ratingsNum = new Array[Int](numUsers) ratings.map(r => ratingsNum(userIdToPos(r.user)) += 1) //end of added InLinkBlock(userIds, ratingsNum, ratingsForBlock) ........ Is this solution reasonable?? > The algorithm of ALS in mlib lacks a parameter > ----------------------------------------------- > > Key: SPARK-2257 > URL: https://issues.apache.org/jira/browse/SPARK-2257 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.0.0 > Environment: spark 1.0 > Reporter: zhengbing li > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > When I test ALS algorithm using netflix data, I find I cannot get the acurate > results declared by the paper. The best MSE value is 0.9066300038109709(RMSE > 0.952), which is worse than the paper's result. If I increase the number of > features or the number of iterations, I will get a worse result. After I > studing the paper and source code, I find a bug in the updateBlock function > of ALS. > orgin code is: > while (i < rank) { > // --- > fullXtX.data(i * rank + i) += lambda > i += 1 > } > The code doesn't consider the number of products that one user rates. So this > code should be modified: > while (i < rank) { > > //ratingsNum(index) equals the number of products that a user rates > fullXtX.data(i * rank + i) += lambda * ratingsNum(index) > i += 1 > } > After I modify code, the MSE value has been decreased, this is one test result > conditions: > val numIterations =20 > val features = 30 > val model = ALS.train(trainRatings,features, numIterations, 0.06) > result of modified version: > MSE: Double = 0.8472313396478773 > RMSE: 0.92045 > results of version of 1.0 > MSE: Double = 1.2680743123043832 > RMSE: 1.1261 > In order to add the vector ratingsNum, I want to change the InLinkBlock > structure as follows: > private[recommendation] case class InLinkBlock(elementIds: Array[Int], > ratingsNum:Array[Int], ratingsForBlock: Array[Array[(Array[Int], > Array[Double])]]) > So I could calculte the vector ratingsNum in the function of makeInLinkBlock. > This is the code I add in the makeInLinkBlock: > ........... > //added > val ratingsNum = new Array[Int](numUsers) > ratings.map(r => ratingsNum(userIdToPos(r.user)) += 1) > //end of added > InLinkBlock(userIds, ratingsNum, ratingsForBlock) > ........ > Is this solution reasonable?? -- This message was sent by Atlassian JIRA (v6.2#6252)