[ 
https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872881#action_12872881
 ] 

Sebastian Schelter commented on MAHOUT-106:
-------------------------------------------

I converted the pig code attached here to plain java M/R code hoping to create 
a plsi implementation for mahout. I got the code working but now I feel kinda 
stuck and I hope that someone can give me advice or join in on this.

The main flaw of this approach is (as Julien already stated above) that the 
computation of Q* produces as many records as number of users * number of 
stories * number of values of z, all of which need to be written to disk which 
makes this code unusable.  

I took a look into Hofmann's paper and it says that the offline complexity of 
this algorithm is O(kN) with N being the number of observed ratings, so I don't 
understand why we would have to look at *all* possible user-item-pairs like it 
is done in the pig code.

One possible approach to solving this problem could be to only compute Q* for 
the observed ratings, I've already tried to only write p(s|z)p(z|u) for all 
oberserved user-item-pairs to disk in the PszPzuReducer (by simply loading all 
ratings into memory, which would introduce a new constraint on this 
algorithm...). It seems to help and it works with the sample data provided with 
the pig code, yet I'm not sure whether it's mathematically correct to do this 
(so that part is commented out in the code).

I also must admit that I dont exactly see how much this approach corresponds to 
the plsi approach presented in "Google News Personalization: Scalable Online 
Collaborative Filtering" 
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf),
 maybe that could be another source for ideas.

The patch is only work in progress, it still uses the old hadoop API, it lacks 
proper documentation and has only one unit test, it's more a proof of concept. 
If it turns out this approach here can work for larger data sets I will invest 
more time to refactor and beautify the code but currently I'm not sure whether 
it's really going to work.

> PLSI/EM in pig based on hofmann's ACM 04 paper. 
> ------------------------------------------------
>
>                 Key: MAHOUT-106
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-106
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>         Environment: Pig/Hadoop 
>            Reporter: Prasen Mukherjee
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.4
>
>         Attachments: plsi_pig.patch
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models 
> for Collaborative Filtering In ACM Transactions on Information Systems, 2004, 
>  vol 22(1), pp. 89-115.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to