[jira] Commented: (MAHOUT-420) Improving the distributed item-based recommender

Sebastian Schelter (JIRA) Fri, 02 Jul 2010 00:18:22 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884574#action_12884574
 ]


Sebastian Schelter commented on MAHOUT-420:
-------------------------------------------

The latest patch should work with the head now.

{quote}
Are there any chances to reduce the number of unique writable objects we're 
creating? There is some need to specialize and create custom objects for 
performance though I do see there are starting to be lots of objects that hold 
one or two primitives and I'm keen to reuse classes if reasonable
{quote}

This is certainly desirable, yet is seems very difficult to me, especially when 
we use features like secondary sort which requires very specialized objects. If 
you see a good starting point on which objects to generalize I'd be ready to 
put some work into that.

{quote}
findDeclaredField() and setField() - yeah I see what you're doing though it 
seems a little fragile to dig inside an object and change its fields. They are 
just tests, so maybe it's OK, but are there alternatives? Even for tests, if 
it's private, I think it's not testable myself.
{quote}

I can see your point here. In an ideal world you would want to write unit-tests 
that know nothing about the actual implementation of the class to test and 
treat it like a "blackbox", giving it some input and then checking the output. 
However you have to get the class to test into a certain state before it can be 
tested in a lot of cases. From my experience the best way to achieve this is to 
make the class ready for dependency injection so it can be configured from 
outside, yet I don't think this really fits good for MapReduce code. So I 
thought the easiest way to get control over the classes state for testing was 
to directly set the private fields which is a very unobtrusive way because the 
code does not need to be changed just for testing purposes, yet this approach 
has the drawback of binding the tests directly to the implementation of the 
classes that are tested. If we want to avoid that we'd have to refactor the 
code a bit to be more testable I think.

A rather complex example for this would be the AggregateAndRecommendReducer 
which fills an OpenIntLongHashMap from a SequenceFile when it is setup. In the 
test for that class, I did not want to create a sequencefile on disk and have 
it read that, because that would make the test code unreadable and instead of 
only testing one method (the reduce() method) I would also implicitly have to 
test the setup method additionally. So I thought the easiest to write and most 
understandable way would be to create an OpentIntLongHashMap and directly 
assign that to the private field. Another solution might be to introduce a 
package-private setter method that could be called by the testcode. 

I can try to refactor the code to avoid the setField() calls, if you want. 

{quote}
Likewise I don't mind adding more utility classes per se but I prefer to avoid 
utils/helper classes if the methods can be reasonably attached to another 
implementation. I haven't looked hard at it, maybe these are necessary, just 
noting one concern.
{quote}

I'm not to fond of utility classes either, especially because they are usually 
called statically. Yet what I dislike more is code duplication and I tried to 
only move methods into utility classes that are called from at least two 
different classes. One example why this is crucial is the parsing of lines from 
preference text files. I saw that in some places only a comma is allowed as 
delimiter, while other classes also allow a tab. That's a typical example for 
what happens when code is duplicated and new functionality is introduced, so I 
thought it'd be far better to move this functionality into a utility class to 
have exactly one place in the code where the delimiter is specified.

{quote}
Does this change behavior of the recommender job or is it the same initial 
input and final output?
{quote}

Everything stays the same, the only difference is that you have to tell the job 
which similarity measure to use.

> Improving the distributed item-based recommender
> ------------------------------------------------
>
>                 Key: MAHOUT-420
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-420
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-420-2.patch, MAHOUT-420.patch
>
>
> A summary of the discussion on the mailing list:
> Extend the distributed item-based recommender from using only simple 
> cooccurrence counts to using the standard computations of an item-based 
> recommender as defined in Sarwar et al "Item-Based Collaborative Filtering 
> Recommendation Algorithms" 
> (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9927&rep=rep1&type=pdf).
> What the distributed recommender generally does is that it computes the 
> prediction values for all users towards all items those users have not rated 
> yet. And the computation is done in the following way:
>  u = a user
>  i = an item not yet rated by u
>  N = all items cooccurring with i
>  Prediction(u,i) = sum(all n from N: cooccurrences(i,n) * rating(u,n))
> The formula used in the paper which is used by 
> GenericItemBasedRecommender.doEstimatePreference(...) too, looks very similar 
> to the one above:
>  u = a user
>  i = an item not yet rated by u
>  N = all items similar to i (where similarity is usually computed by 
> pairwisely comparing the item-vectors of the user-item matrix)
>  Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all 
> n from N: abs(similarity(i,n)))
> There are only 2 differences:
>  a) instead of the cooccurrence count, certain similarity measures like 
> pearson or cosine can be used
>  b) the resulting value is normalized by the sum of the similarities
> To overcome difference a) we would only need to replace the part that 
> computes the cooccurrence matrix with the code from ItemSimilarityJob or the 
> code introduced in MAHOUT-418, then we could compute arbitrary similarity 
> matrices and use them in the same way the cooccurrence matrix is currently 
> used. We just need to separate steps up to creating the co-occurrence matrix 
> from the rest, which is simple since they're already different MR jobs. 
> Regarding difference b) from a first look at the implementation I think it 
> should be possible to transfer the necessary similarity matrix entries from 
> the PartialMultiplyMapper to the AggregateAndRecommendReducer to be able to 
> compute the normalization value in the denominator of the formula. This will 
> take a little work, yes, but is still straightforward. It canbe in the 
> "common" part of the process, done after the similarity matrix is generated.
> I think work on this issue should wait until MAHOUT-418 is resolved as the 
> implementation here depends on how the pairwise similarities will be computed 
> in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-420) Improving the distributed item-based recommender

Reply via email to