Hi all,
As Sebastian says I've been looking into this, but being a Hadoop and Mahout
newbie I'm not quite sure if I've fully understood everything.
Anyway, my observations are these (sorry for the formatting):
((s|z)p(z|u)
Q*(z|u,y) = -------------------
sum(p(s|z)p(z|u))
means that if there is no co-occurrence of u and y, there is no value for u and
y given z, i.e. it's a non-value and should not be calculated.
How this translates to mapreduce is something I'm not quite sure on, however,
having looked at the code I think that the QStarReducer could be extended to
only ouput values if there is a co-occurrence (i.e. only when f(u,y) = 1). I
guess this would require either reading from another file?
I'm happy to put quite a lot of effort into this, I would however need some
guidance, since right now I can't even get the code to pass the one test that
is bundled with it.
Alan
--
***************************************
M.Sc.(Eng.) Alan Said
Compentence Center Information Retrieval & Machine Learning
Technische Universität Berlin / DAI-Lab
Sekr. TEL 14 Ernst-Reuter-Platz 7
10587 Berlin / Germany
Phone: 0049 - 30 - 314 74072
Fax: 0049 - 30 - 314 74003
E-mail: [email protected]
http://www.dai-labor.de
***************************************
From: Sebastian Schelter [mailto:[email protected]]
Sent: Thursday, November 25, 2010 11:08 AM
To: Alan Said
Subject: Fwd: MAHOUT-106
-------- Original Message --------
Subject:
MAHOUT-106
Date:
Thu, 25 Nov 2010 11:41:25 +0100
From:
Sebastian Schelter <[email protected]>
Reply-To:
[email protected]
To:
Mahout Dev List <[email protected]>
I'm having an interesting twitter conversation with Alan about MAHOUT-106 that
should better be moved here.
Alan is currently looking at the port of the pig code and asked why it's so bad
to write #user * #items * z values which I guess refers to my jira comment at
https://issues.apache.org/jira/browse/MAHOUT-106?focusedCommentId=12872881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12872881
It's bad because in the pig code (and the java port of that) it's not done for
the known entries of the matrix only (thus using its sparsity) but for *all*
possible entries. That won't scale and is IMHO an incorrect interpretation of
the algorithm as Hoffman's paper states that the algorithms complexity is O(zN)
with N being the number of observed ratings.
Alan also asked for a more commented version of the code (there is non
unfortunately) but I think a lot of the code was written looking at the
description of PLSI in "Google News Personalization: Scalable Online
Collaborative Filtering" (
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf
)
--sebastian