[
https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897799#action_12897799
]
Sebastian Schelter commented on MAHOUT-467:
-------------------------------------------
There's no point of time in the algorithm before the invocation of the
SimilarityReducer where all cooccurrences for a pair of rows are seen together,
so to load them into memory there it would also be necessary to iterate over
them.
> Change Iterable<Cooccurrence> in
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer
> to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-467
> URL: https://issues.apache.org/jira/browse/MAHOUT-467
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Hui Wen Han
> Fix For: 0.4
>
>
> In Class AbstractDistributedVectorSimilarity
> protected int countElements(Iterator<?> iterator)
> { int count = 0;
> while (iterator.hasNext())
> {
> count++;
> iterator.next();
> } return count;
> }
> The method countElements is used continually and is called continually ,but
> it has bad performance.
> If the iterator has million elements ,we have to iterate million times to
> just get the count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity
> public class DistributedCooccurrenceVectorSimilarity extends
> AbstractDistributedVectorSimilarity {
> @Override
> protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence>
> cooccurrences, double weightOfVectorA,
> double weightOfVectorB, int numberOfColumns) {
> return countElements(cooccurrences);
> }
> }
> one items may be liked by many people, we has system ,one items may be liked
> by hundred thousand persons,
> Here doComputeResult just returned the count of elements in
> cooccurrences,but It has to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it
> already sets the size of the Array or list when system constructs the List or
> Array.
> 2) DistributedLoglikelihoodVectorSimilarity
> 3) DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity
> it used 4.5 hours to run
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.