Performance of RowSimilarityJob is not good
-------------------------------------------

                 Key: MAHOUT-468
                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
             Project: Mahout
          Issue Type: Test
          Components: Collaborative Filtering
    Affects Versions: 0.4
            Reporter: Hui Wen Han
             Fix For: 0.4


I have done a test ,

Preferences records: 680,194
distinct users: 23,246
distinct items:437,569 
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE

maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 
32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours


I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= 
Hadoop node counts ) mappers or reducers concurrently.
2)  We stored redundant info.

for example :

the output of CooccurrencesMapper: 
(ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)

3) Some frequently used code 
https://issues.apache.org/jira/browse/MAHOUT-467

4) allocate many local variable in loop (need confirm )

In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity

  @Override
  public double weight(Vector v) {
    double length = 0.0;
    Iterator<Element> elemIterator = v.iterateNonZero();
    while (elemIterator.hasNext()) {
      double value = elemIterator.next().get();  //this one
      length += value * value;
    }
    return Math.sqrt(length);
  }



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to