[ https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740629#action_12740629 ]
Julien Le Dem commented on MAHOUT-106: -------------------------------------- Hi, First of all, thanks a lot to Prasen for this PLSI implementation :) 2 comments: 1) As is, it just works in pig local mode and has a dependency on Python. I suggest removing the dependency on Python and update the scripts so it runs also in mapred mode. If you agree I can propose an updated patch. 2) I've been looking at the complexity of the algorithm. The computation of Q* produces as many records as number of users * number of stories * number of values of z which get quickly to a pretty big number. The article states it's been run on a dataset of 61265*1623*30 ~ 3E9 records for Q* I'm looking at the record count as opposed to operations because this is something that will cause IO and a bottleneck in the processing. Have you tried running it on larger datasets ? What optimization do you think can be applied to run on larger datasets ? > PLSI/EM in pig based on hofmann's ACM 04 paper. > ------------------------------------------------ > > Key: MAHOUT-106 > URL: https://issues.apache.org/jira/browse/MAHOUT-106 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering > Environment: Pig/Hadoop > Reporter: Prasen Mukherjee > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 0.2 > > Attachments: plsi_pig.patch > > Original Estimate: 96h > Remaining Estimate: 96h > > Based on the following paper by hofmann : T. Hofmann Latent Semantic Models > for Collaborative Filtering In ACM Transactions on Information Systems, 2004, > vol 22(1), pp. 89-115. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.