Re: [jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

Sebastian Schelter Thu, 12 Aug 2010 21:54:03 -0700

+1 on closing this

Am 13.08.2010 04:44, schrieb Ted Dunning (JIRA):
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898051#action_12898051
>  ] 
>
> Ted Dunning commented on MAHOUT-467:
> ------------------------------------
>
>
> That comment from Owen is essentially the same as what I and others have 
> said.  If you need to count, use integers and combiners.  Don't wait until 
> the reducer.
>
> In any case, it isn't a MAHOUT bug that map-reduce inherently doesn't allow 
> you to see all the data being reduced in one place.
>
> Anybody mind if I close this as NOT-A-BUG?
>
>
>
>   
>> Change Iterable<Cooccurrence> in  
>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer  
>> to list or array to improve the performance
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>                 Key: MAHOUT-467
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-467
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Collaborative Filtering
>>    Affects Versions: 0.4
>>            Reporter: Hui Wen Han
>>             Fix For: 0.4
>>
>>
>> In Class AbstractDistributedVectorSimilarity
>>       protected int countElements(Iterator<?> iterator)
>>       { int count = 0;
>>           while (iterator.hasNext()) 
>>           {
>>                   count++; 
>>                   iterator.next(); 
>>            } return count; 
>>     }
>> The method countElements is used continually and is called continually ,but 
>> it has bad performance.
>> If the iterator has million elements ,we have to iterate million  times to 
>> just get the count of the iterator.
>> this methods used in many pacles:
>> 1) DistributedCooccurrenceVectorSimilarity 
>> public class DistributedCooccurrenceVectorSimilarity extends 
>> AbstractDistributedVectorSimilarity {
>>   @Override
>>   protected double doComputeResult(int rowA, int rowB, 
>> Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
>>       double weightOfVectorB, int numberOfColumns) {
>>     return countElements(cooccurrences);
>>   }
>> }
>> one items may be liked by many people, we has system ,one items may be liked 
>> by  hundred thousand persons,
>> Here doComputeResult just returned the count of elements in  
>> cooccurrences,but It has to iterate for hundred thousand times.
>> If we use List or Array type,we can get the result in one call. because it 
>> already sets the size of the Array or list when system constructs the List 
>> or Array.
>> 2)  DistributedLoglikelihoodVectorSimilarity
>> 3)  DistributedTanimotoCoefficientVectorSimilarity
>> I have doing a test using DistributedCooccurrenceVectorSimilarity 
>> it used 4.5 hours to run 
>> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
>>     
>

Re: [jira] Commented: (MAHOUT-467) Change Iterable in org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer to list or array to improve the performance

Reply via email to