[
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100265#comment-13100265
]
Sebastian Schelter commented on MAHOUT-767:
-------------------------------------------
A summary of my current work so far, a new patch is coming:
We should only support algebraic similarity measures which allows us to use a
combiner in the most crucial phase. Furthermore we will use the stripes-pattern
for in-mapper combination of cooccurrences to avoid emitting lots of
cooccurrence pair objects.
This issue also touches ItemSimilarityJob and RecommenderJob as they use
RowSimilarityJob internally. We will introduce a new job responsible for
preparing the input data for these jobs.
As the distribution of ratings per user and ratings per item follow power-law
distributions usually, appropriate down-sampling is crucial for the performance
of these jobs as their runtime is dominated by the user with the largest number
of interactions. We should remove the old "maxCooccurrencesPerItem" heuristic
as it depends on the number of mappers that are run and the ordering of the
input data. A simple random downsampling of users having a number of ratings
above a threshold should work better.
> Improve RowSimilarityJob performance
> ------------------------------------
>
> Key: MAHOUT-767
> URL: https://issues.apache.org/jira/browse/MAHOUT-767
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.5
> Reporter: Grant Ingersoll
> Assignee: Sebastian Schelter
> Fix For: 0.6
>
> Attachments: MAHOUT-767.patch
>
>
> (See
> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7
> for background)
> Currently, the RowSimilarityJob defers the calculation of the similarity
> metric until the reduce phase, while emitting many Cooccurrence objects. For
> similarity metrics that are algebraic
> (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be
> able to do much of the computation during the Mapper part of this phase and
> also take advantage of a Combiner.
> We should use a marker interface to know whether a similarity metric is
> algebraic and then make use of an appropriate Mapper implementation,
> otherwise we can fall back on our existing implementation.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira