[
https://issues.apache.org/jira/browse/MAHOUT-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898939#action_12898939
]
Han Hui Wen commented on MAHOUT-473:
-------------------------------------
1) First mapred.tasktracker.map.tasks.maximum in mapred-site.xml is different
from -Dmapred.reduce.tasks in command line or mapred.reduce.tasks in
mapred-site.xml .
The maximum number of map tasks that will be run on a tasktracker at one time is
controlled by the mapred.tasktracker.map.tasks.maximum property, which defaults
to
two tasks. There is a corresponding property for reduce tasks, mapred.task
tracker.reduce.tasks.maximum, which also defaults to two tasks.
mapred.reduce.tasks configures in mapred-site.xml,the default value is 1.
Options specified with -D take priority over properties from the configuration
files. This is very useful: you can put defaults into configuration files, and
then override
them with the -D option as needed. A common example of this is setting the
number
of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will
override the
number of reducers set on the cluster, or if set in any client-side
configuration files.
Job Initialization
When the JobTracker receives a call to its submitJob() method, it puts it into
an internal
queue from where the job scheduler will pick it up and initialize it.
Initialization involves
creating an object to represent the job being run, which encapsulates its
tasks, and
bookkeeping information to keep track of the tasks' status and progress .
To create the list of tasks to run, the job scheduler first retrieves the input
splits computed
by the JobClient from the shared filesystem . It then creates one map task
for each split. The number of reduce tasks to create is determined by the
mapred.reduce.tasks property in the JobConf, which is set by the setNumReduce
Tasks() method, and the scheduler simply creates this number of reduce tasks to
be
run. Tasks are given IDs at this point.
mapred.reduce.tasks can be configured file mapred-site.xml ,but
mapred.reduce.tasks in file mapred-site.xml is used for all reducer JOBs.
we have all kinds of Job,they have different different data size,different
priority.we can not change the mapred-site.xml file continually.so we need
-Dmapred.reduce.tasks parameter in command line to override the configuration
file.
2) Parameter like "-Dmapred.reduce.tasks" passed the RecommenderJob Can not
pass to RowSimilarity,
Because RowSimilarityJob started another new AbstractJob. RowSimilarityJob
run in a new process, parameters passed to RecommenderJob can not passed to
RowSimilarityJob.
If we passed -Dmapred.reduce.tasks=6 to RecommenderJob ,all sub jobs (like
itemIDIndex,toUserVector,countUsers,itemUserMatrix,maybePruneItemUserMatrix,prePartialMultiply1,prePartialMultiply2,partialMultiply,aggregateAndRecommend)
work correctly ,but Job RowSimilarityJob in a new context ,parameter
-Dmapred.reduce.tasks=6 lost when Call RowSimilarityJob ,
So All 3 sub-jobs of RowSimilarityJob (weights,pairwiseSimilarity,asMatrix) can
only run using one reducer.
> add parameter -Dmapred.reduce.tasks when call job RowSimilarityJob in
> RecommenderJob
> ------------------------------------------------------------------------------------
>
> Key: MAHOUT-473
> URL: https://issues.apache.org/jira/browse/MAHOUT-473
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.4
> Reporter: Han Hui Wen
> Assignee: Sean Owen
> Attachments: screenshot-1.jpg
>
>
> In RecommenderJob
> {code:title=RecommenderJob.java|borderStyle=solid}
> int numberOfUsers = TasteHadoopUtils.readIntFromFile(getConf(),
> countUsersPath);
> if (shouldRunNextPhase(parsedArgs, currentPhase)) {
> /* Once DistributedRowMatrix uses the hadoop 0.20 API, we should
> refactor this call to something like
> * new DistributedRowMatrix(...).rowSimilarity(...) */
> try {
> RowSimilarityJob.main(new String[] { "-Dmapred.input.dir=" +
> maybePruneItemUserMatrixPath.toString(),
> "-Dmapred.output.dir=" + similarityMatrixPath.toString(),
> "--numberOfColumns",
> String.valueOf(numberOfUsers), "--similarityClassname",
> similarityClassname, "--maxSimilaritiesPerRow",
> String.valueOf(maxSimilaritiesPerItemConsidered + 1),
> "--tempDir", tempDirPath.toString() });
> } catch (Exception e) {
> throw new IllegalStateException("item-item-similarity computation
> failed", e);
> }
> }
> {code}
> We have not passed parameter -Dmapred.reduce.tasks when job RowSimilarityJob.
> It caused all three RowSimilarityJob sub-jobs run using 1 map and 1 reduce,
> so the sub jobs can not be scalable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.