[jira] Commented: (MAHOUT-473) add parameter -Dmapred.reduce.tasks when call job RowSimilarityJob in RecommenderJob

Han Hui Wen (JIRA) Mon, 16 Aug 2010 07:06:58 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898939#action_12898939
 ]


Han Hui Wen  commented on MAHOUT-473:
-------------------------------------

1) First  mapred.tasktracker.map.tasks.maximum in  mapred-site.xml is different 
from  -Dmapred.reduce.tasks in command line or mapred.reduce.tasks in 
mapred-site.xml .

The maximum number of map tasks that will be run on a tasktracker at one time is
controlled by the mapred.tasktracker.map.tasks.maximum property, which defaults 
to
two tasks. There is a corresponding property for reduce tasks, mapred.task
tracker.reduce.tasks.maximum, which also defaults to two tasks.

mapred.reduce.tasks configures in  mapred-site.xml,the default value is 1.
Options specified with -D take priority over properties from the configuration
files. This is very useful: you can put defaults into configuration files, and 
then override
them with the -D option as needed. A common example of this is setting the 
number
of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will 
override the
number of reducers set on the cluster, or if set in any client-side 
configuration files.


Job Initialization
When the JobTracker receives a call to its submitJob() method, it puts it into 
an internal
queue from where the job scheduler will pick it up and initialize it. 
Initialization involves
creating an object to represent the job being run, which encapsulates its 
tasks, and
bookkeeping information to keep track of the tasks' status and progress .
To create the list of tasks to run, the job scheduler first retrieves the input 
splits computed
by the JobClient from the shared filesystem . It then creates one map task
for each split. The number of reduce tasks to create is determined by the
mapred.reduce.tasks property in the JobConf, which is set by the setNumReduce
Tasks() method, and the scheduler simply creates this number of reduce tasks to 
be
run. Tasks are given IDs at this point.

 mapred.reduce.tasks can be configured  file mapred-site.xml  ,but 
mapred.reduce.tasks in  file mapred-site.xml is used for all reducer JOBs.
we have all kinds of Job,they have different different data size,different 
priority.we can not change the mapred-site.xml  file continually.so we need 
-Dmapred.reduce.tasks parameter in command line to override the  configuration 
file.

2)  Parameter like "-Dmapred.reduce.tasks"  passed the RecommenderJob Can not 
pass to RowSimilarity,
Because  RowSimilarityJob started another new AbstractJob. RowSimilarityJob  
run in a new process, parameters passed to RecommenderJob  can not passed to 
RowSimilarityJob.

If we passed -Dmapred.reduce.tasks=6 to RecommenderJob ,all sub jobs (like 
itemIDIndex,toUserVector,countUsers,itemUserMatrix,maybePruneItemUserMatrix,prePartialMultiply1,prePartialMultiply2,partialMultiply,aggregateAndRecommend)
 work correctly ,but Job RowSimilarityJob in a new context ,parameter  
-Dmapred.reduce.tasks=6 lost when Call RowSimilarityJob ,
So All 3 sub-jobs of RowSimilarityJob (weights,pairwiseSimilarity,asMatrix) can 
only run using one reducer.




> add parameter -Dmapred.reduce.tasks when call job RowSimilarityJob in 
> RecommenderJob
> ------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-473
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-473
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.4
>            Reporter: Han Hui Wen 
>            Assignee: Sean Owen
>         Attachments: screenshot-1.jpg
>
>
> In RecommenderJob
> {code:title=RecommenderJob.java|borderStyle=solid}
>     int numberOfUsers = TasteHadoopUtils.readIntFromFile(getConf(), 
> countUsersPath);
>     if (shouldRunNextPhase(parsedArgs, currentPhase)) {
>       /* Once DistributedRowMatrix uses the hadoop 0.20 API, we should 
> refactor this call to something like
>        * new DistributedRowMatrix(...).rowSimilarity(...) */
>       try {
>         RowSimilarityJob.main(new String[] { "-Dmapred.input.dir=" + 
> maybePruneItemUserMatrixPath.toString(),
>             "-Dmapred.output.dir=" + similarityMatrixPath.toString(), 
> "--numberOfColumns",
>             String.valueOf(numberOfUsers), "--similarityClassname", 
> similarityClassname, "--maxSimilaritiesPerRow",
>             String.valueOf(maxSimilaritiesPerItemConsidered + 1), 
> "--tempDir", tempDirPath.toString() });
>       } catch (Exception e) {
>         throw new IllegalStateException("item-item-similarity computation 
> failed", e);
>       }
>     }
> {code}
> We have not passed parameter -Dmapred.reduce.tasks when job RowSimilarityJob.
> It caused all three  RowSimilarityJob sub-jobs run using 1 map and 1 reduce, 
> so the sub jobs can not be scalable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-473) add parameter -Dmapred.reduce.tasks when call job RowSimilarityJob in RecommenderJob

Reply via email to