RE: MAHOUT-106

Alan Said Thu, 25 Nov 2010 04:25:19 -0800

Hi all,
As Sebastian says I've been looking into this, but being a Hadoop and Mahout 
newbie I'm not quite sure if I've fully understood everything.
Anyway, my observations are these (sorry for the formatting):


                              ((s|z)p(z|u)
Q*(z|u,y)  = -------------------
                          sum(p(s|z)p(z|u))

means that if there is no co-occurrence of u and y, there is no value for u and 
y given z, i.e. it's a non-value and should not be calculated.
How this translates to mapreduce is something I'm not quite sure on, however, 
having looked at the code I think that the QStarReducer could be extended to 
only ouput values if there is a co-occurrence (i.e. only when f(u,y) = 1). I 
guess this would require either reading from another file?

I'm happy to put quite a lot of effort into this, I would however need some 
guidance, since right now I can't even get the code to pass the one test that 
is bundled with it.


Alan
-- 
***************************************
M.Sc.(Eng.) Alan Said
Compentence Center Information Retrieval & Machine Learning 
Technische Universität Berlin / DAI-Lab 
Sekr. TEL 14 Ernst-Reuter-Platz 7
10587 Berlin / Germany
Phone:  0049 - 30 - 314 74072
Fax:    0049 - 30 - 314 74003
E-mail: [email protected]
http://www.dai-labor.de
***************************************

From: Sebastian Schelter [mailto:[email protected]] 
Sent: Thursday, November 25, 2010 11:08 AM
To: Alan Said
Subject: Fwd: MAHOUT-106



-------- Original Message -------- 
Subject: 
MAHOUT-106
Date: 
Thu, 25 Nov 2010 11:41:25 +0100
From: 
Sebastian Schelter <[email protected]>
Reply-To: 
[email protected]
To: 
Mahout Dev List <[email protected]>


I'm having an interesting twitter conversation with Alan about MAHOUT-106 that 
should better be moved here.

Alan is currently looking at the port of the pig code and asked why it's so bad 
to write #user * #items * z values which I guess refers to my jira comment at 
https://issues.apache.org/jira/browse/MAHOUT-106?focusedCommentId=12872881&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12872881

It's bad because in the pig code (and the java port of that) it's not done for 
the known entries of the matrix only (thus using its sparsity) but for *all* 
possible entries. That won't scale and is IMHO an incorrect interpretation of 
the algorithm as Hoffman's paper states that the algorithms complexity is O(zN) 
with N being the number of observed ratings.

Alan also asked for a more commented version of the code (there is non 
unfortunately) but I think a lot of the code was written looking at the 
description of PLSI in "Google News Personalization: Scalable Online 
Collaborative Filtering" ( 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf
 )

--sebastian

RE: MAHOUT-106

Reply via email to