[jira] [Work logged] (MAHOUT-468) Performance of RowSimilarityJob is not good

ASF GitHub Bot (Jira) Thu, 14 Nov 2024 14:12:20 -0800


     [ 
https://issues.apache.org/jira/browse/MAHOUT-468?focusedWorklogId=943847&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-943847
 ]


ASF GitHub Bot logged work on MAHOUT-468:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Nov/24 22:10
            Start Date: 14/Nov/24 22:10
    Worklog Time Spent: 10m 
      Work Description: andrewmusselman merged PR #472:
URL: https://github.com/apache/mahout/pull/472




Issue Time Tracking
-------------------

    Worklog Id:     (was: 943847)
    Time Spent: 1.5h  (was: 1h 20m)

> Performance of RowSimilarityJob is not good
> -------------------------------------------
>
>                 Key: MAHOUT-468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-468
>             Project: Mahout
>          Issue Type: Test
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Priority: Major
>         Attachments: 
> RowSimilarityJob-CooccurrencesMapper-SimilarityReducer.jpg
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I have done a test ,
> Preferences records: 680,194
> distinct users: 23,246
> distinct items:437,569 
> SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
> maybePruneItemUserMatrixPath:16.50M
> weights:13.80M
> pairwiseSimilarity:18.81G
> Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 
> 32 sec
> Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
> I think the reason may be following:
> 1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= 
> Hadoop node counts ) mappers or reducers concurrently.
> 2)  We stored redundant info.
> for example :
> the output of CooccurrencesMapper: 
> (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
> 3) Some frequently used code 
> https://issues.apache.org/jira/browse/MAHOUT-467
> 4) allocate many local variable in loop (need confirm )
> In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
>   @Override
>   public double weight(Vector v) {
>     double length = 0.0;
>     Iterator<Element> elemIterator = v.iterateNonZero();
>     while (elemIterator.hasNext()) {
>       double value = elemIterator.next().get();  //this one
>       length += value * value;
>     }
>     return Math.sqrt(length);
>   }
> 5) Maybe we need control the size of cooccurrences



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (MAHOUT-468) Performance of RowSimilarityJob is not good

Reply via email to