+1 on closing this
Am 13.08.2010 04:44, schrieb Ted Dunning (JIRA):
> [
> https://issues.apache.org/jira/browse/MAHOUT-467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898051#action_12898051
> ]
>
> Ted Dunning commented on MAHOUT-467:
> ------------------------------------
>
>
> That comment from Owen is essentially the same as what I and others have
> said. If you need to count, use integers and combiners. Don't wait until
> the reducer.
>
> In any case, it isn't a MAHOUT bug that map-reduce inherently doesn't allow
> you to see all the data being reduced in one place.
>
> Anybody mind if I close this as NOT-A-BUG?
>
>
>
>
>> Change Iterable<Cooccurrence> in
>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob.SimilarityReducer
>> to list or array to improve the performance
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Key: MAHOUT-467
>> URL: https://issues.apache.org/jira/browse/MAHOUT-467
>> Project: Mahout
>> Issue Type: Improvement
>> Components: Collaborative Filtering
>> Affects Versions: 0.4
>> Reporter: Hui Wen Han
>> Fix For: 0.4
>>
>>
>> In Class AbstractDistributedVectorSimilarity
>> protected int countElements(Iterator<?> iterator)
>> { int count = 0;
>> while (iterator.hasNext())
>> {
>> count++;
>> iterator.next();
>> } return count;
>> }
>> The method countElements is used continually and is called continually ,but
>> it has bad performance.
>> If the iterator has million elements ,we have to iterate million times to
>> just get the count of the iterator.
>> this methods used in many pacles:
>> 1) DistributedCooccurrenceVectorSimilarity
>> public class DistributedCooccurrenceVectorSimilarity extends
>> AbstractDistributedVectorSimilarity {
>> @Override
>> protected double doComputeResult(int rowA, int rowB,
>> Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
>> double weightOfVectorB, int numberOfColumns) {
>> return countElements(cooccurrences);
>> }
>> }
>> one items may be liked by many people, we has system ,one items may be liked
>> by hundred thousand persons,
>> Here doComputeResult just returned the count of elements in
>> cooccurrences,but It has to iterate for hundred thousand times.
>> If we use List or Array type,we can get the result in one call. because it
>> already sets the size of the Array or list when system constructs the List
>> or Array.
>> 2) DistributedLoglikelihoodVectorSimilarity
>> 3) DistributedTanimotoCoefficientVectorSimilarity
>> I have doing a test using DistributedCooccurrenceVectorSimilarity
>> it used 4.5 hours to run
>> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer
>>
>