[ 
https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12740836#action_12740836
 ] 

Prasen Mukherjee commented on MAHOUT-106:
-----------------------------------------

Totally agree with Julien's (1) comment.   I was too lazy to write UDF-java 
code.

On (2) : The main E/M  code is in plsi_singleiteration.pig.  Although ( as you 
have rightly pointed out ) the computation of q* produces that many ( s*z*u) 
results , I feel that in the E/M pig-code we are not loading  that many data at 
any point of time in the memory. I think at any point of time we are only 
accessing at most ( s*z or s*u ) number of entries. That too can be eliminated 
by introducing an algebraic UDF, which is probably happening  at the 1-st and 
2-nd m-steps in the following lines 
--compute sum(z,u) = sum_over_z(nq_zu) -- means group by u
--compute sum(s,z) = sum_over_s(nq_sz) -- means group by z


Having said that I will admit that I personally have not run it on very large 
datasets. 

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.2
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models 
> for Collaborative Filtering In ACM Transactions on Information Systems, 2004, 
>  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to