[jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

Sebastian Schelter (JIRA) Thu, 12 Aug 2010 07:35:44 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897771#action_12897771
 ]


Sebastian Schelter commented on MAHOUT-467:
-------------------------------------------

For the millions of cooccurrences to be modeled as a list or and array, we 
would have to simultaneously load them all into memory.
We can't do this because then the scalability of the whole job would be limited 
by the amount of RAM available on the worker machines. 
IIRC Mahout's goal is that its distributed jobs should run in O(n) concerning 
the input data and O(1) concerning the amount of memory needed.


> Change Iterable<Cooccurrence> in  
> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer  
> to list or array to improve the performance
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-467
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> In Class AbstractDistributedVectorSimilarity
>       protected int countElements(Iterator<?> iterator)
>       { int count = 0;
>           while (iterator.hasNext()) 
>           {
>                   count++; 
>                   iterator.next(); 
>            } return count; 
>     }
> The method countElements is used continually and is called continually ,but 
> it has bad performance.
> If the iterator has million elements ,we have to iterate million  times to 
> just get the count of the iterator.
> this methods used in many pacles:
> 1) DistributedCooccurrenceVectorSimilarity 
> public class DistributedCooccurrenceVectorSimilarity extends 
> AbstractDistributedVectorSimilarity {
>   @Override
>   protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> 
> cooccurrences, double weightOfVectorA,
>       double weightOfVectorB, int numberOfColumns) {
>     return countElements(cooccurrences);
>   }
> }
> one items may be liked by many people, we has system ,one items may be liked 
> by  hundred thousand persons,
> Here doComputeResult just returned the count of elements in  
> cooccurrences,but It has to iterate for hundred thousand times.
> If we use List or Array type,we can get the result in one call. because it 
> already sets the size of the Array or list when system constructs the List or 
> Array.
> 2)  DistributedLoglikelihoodVectorSimilarity
> 3)  DistributedTanimotoCoefficientVectorSimilarity
> I have doing a test using DistributedCooccurrenceVectorSimilarity 
> it used 4.5 hours to run 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

Reply via email to