[ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740836#action_12740836 ]
Prasen Mukherjee commented on MAHOUT-106: ----------------------------------------- Totally agree with Julien's (1) comment. I was too lazy to write UDF-java code. On (2) : The main E/M code is in plsi_singleiteration.pig. Although ( as you have rightly pointed out ) the computation of q* produces that many ( s*z*u) results , I feel that in the E/M pig-code we are not loading that many data at any point of time in the memory. I think at any point of time we are only accessing at most ( s*z or s*u ) number of entries. That too can be eliminated by introducing an algebraic UDF, which is probably happening at the 1-st and 2-nd m-steps in the following lines --compute sum(z,u) = sum_over_z(nq_zu) -- means group by u --compute sum(s,z) = sum_over_s(nq_sz) -- means group by z Having said that I will admit that I personally have not run it on very large datasets. > PLSI/EM in pig based on hofmann's ACM 04 paper. > ------------------------------------------------ > > Key: MAHOUT-106 > URL: https://issues.apache.org/jira/browse/MAHOUT-106 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering > Environment: Pig/Hadoop > Reporter: Prasen Mukherjee > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 0.2 > > Attachments: plsi_pig.patch > > Original Estimate: 96h > Remaining Estimate: 96h > > Based on the following paper by hofmann : T. Hofmann Latent Semantic Models > for Collaborative Filtering In ACM Transactions on Information Systems, 2004, > vol 22(1), pp. 89-115. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.