Bhaskar Devireddy created MAHOUT-1007:
-----------------------------------------

             Summary: Performance improvement in recommenditembased by 
splitting long records
                 Key: MAHOUT-1007
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1007
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
    Affects Versions: 0.6
            Reporter: Bhaskar Devireddy
            Assignee: Sean Owen
            Priority: Minor
             Fix For: 0.7


While running the recommendations with ASFEMail dataset using the example 
script provided with mahout, we are noticing that one of the map task in 
unsymmetrify mapper job has a very long execution time than others.  While 
profiling, the problem seems to be with the number of elements in each record.  
The attached patch address this issue by splitting longer records into smaller 
once, so the data distributed evenly among the unsymmetrify map tasks.

There is a new command line option maxSimilarityReducerVectorSize is introduced 
for RecommanderJob.  Tested with maxSimilarityReducerVectorSize=5000 and with 
same functionality speeds up unsymmetrify mapper job by several X on x86 
architectures and increases CPU utilization.  By default the records are not 
split and setting the command line option maxSimilarityReducerVectorSize to a 
value greater than 0 will increase performance.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to