[
https://issues.apache.org/jira/browse/MAHOUT-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872881#action_12872881
]
Sebastian Schelter commented on MAHOUT-106:
-------------------------------------------
I converted the pig code attached here to plain java M/R code hoping to create
a plsi implementation for mahout. I got the code working but now I feel kinda
stuck and I hope that someone can give me advice or join in on this.
The main flaw of this approach is (as Julien already stated above) that the
computation of Q* produces as many records as number of users * number of
stories * number of values of z, all of which need to be written to disk which
makes this code unusable.
I took a look into Hofmann's paper and it says that the offline complexity of
this algorithm is O(kN) with N being the number of observed ratings, so I don't
understand why we would have to look at *all* possible user-item-pairs like it
is done in the pig code.
One possible approach to solving this problem could be to only compute Q* for
the observed ratings, I've already tried to only write p(s|z)p(z|u) for all
oberserved user-item-pairs to disk in the PszPzuReducer (by simply loading all
ratings into memory, which would introduce a new constraint on this
algorithm...). It seems to help and it works with the sample data provided with
the pig code, yet I'm not sure whether it's mathematically correct to do this
(so that part is commented out in the code).
I also must admit that I dont exactly see how much this approach corresponds to
the plsi approach presented in "Google News Personalization: Scalable Online
Collaborative Filtering"
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf),
maybe that could be another source for ideas.
The patch is only work in progress, it still uses the old hadoop API, it lacks
proper documentation and has only one unit test, it's more a proof of concept.
If it turns out this approach here can work for larger data sets I will invest
more time to refactor and beautify the code but currently I'm not sure whether
it's really going to work.
> PLSI/EM in pig based on hofmann's ACM 04 paper.
> ------------------------------------------------
>
> Key: MAHOUT-106
> URL: https://issues.apache.org/jira/browse/MAHOUT-106
> Project: Mahout
> Issue Type: New Feature
> Components: Collaborative Filtering
> Environment: Pig/Hadoop
> Reporter: Prasen Mukherjee
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.4
>
> Attachments: plsi_pig.patch
>
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> Based on the following paper by hofmann : T. Hofmann Latent Semantic Models
> for Collaborative Filtering In ACM Transactions on Information Systems, 2004,
> vol 22(1), pp. 89-115.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.