[jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

Sebastian Schelter (JIRA) Thu, 12 Aug 2010 08:35:45 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897799#action_12897799
 ]


Sebastian Schelter commented on MAHOUT-467:
-------------------------------------------

There's no point of time in the algorithm before the invocation of the 
SimilarityReducer where all cooccurrences for a pair of rows are seen together, 
so to load them into memory there it would also be necessary to iterate over 
them.


> Change Iterable<Cooccurrence> in  
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer  
> to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-467
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> In Class AbstractDistributedVectorSimilarity
>       protected int countElements(Iterator<?> iterator)
>       { int count = 0;
>           while (iterator.hasNext()) 
>           {
>                   count++; 
>                   iterator.next(); 
>            } return count; 
>     }
> The method countElements is used continually and is called continually ,but 
> it has bad performance.
> If the iterator has million elements ,we have to iterate million  times to 
> just get the count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity 
> public class DistributedCooccurrenceVectorSimilarity extends 
> AbstractDistributedVectorSimilarity {
>   @Override
>   protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> 
> cooccurrences, double weightOfVectorA,
>       double weightOfVectorB, int numberOfColumns) {
>     return countElements(cooccurrences);
>   }
> }
> one items may be liked by many people, we has system ,one items may be liked 
> by  hundred thousand persons,
> Here doComputeResult just returned the count of elements in  
> cooccurrences,but It has to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it 
> already sets the size of the Array or list when system constructs the List or 
> Array.
> 2)  DistributedLoglikelihoodVectorSimilarity
> 3)  DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity 
> it used 4.5 hours to run 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

Reply via email to