[ 
https://issues.apache.org/jira/browse/MAHOUT-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897934#action_12897934
 ] 

Sean Owen commented on MAHOUT-468:
----------------------------------

Yes, there is only positive effect to #4 and it should not be changed. There is 
no "allocation" of stack variables at runtime in Java. The alternative, to save 
the result of next() and then call get() twice is definitely slower.

What's left as the concrete issue here? MAHOUT-467 is separate. #1 should be 
OK. The redundant info in #2 doesn't seem like a big sin.

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Hui Wen Han
>             Fix For: 0.4
>
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 
> 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= 
> Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: 
> (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to