[
https://issues.apache.org/jira/browse/NUTCH-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611789#comment-16611789
]
ASF GitHub Bot commented on NUTCH-2637:
---------------------------------------
sebastian-nagel closed pull request #381: NUTCH-2637 fix number of reducers to
run
URL: https://github.com/apache/nutch/pull/381
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/src/java/org/apache/nutch/fetcher/FetcherJob.java
b/src/java/org/apache/nutch/fetcher/FetcherJob.java
index f4b97cbbb..b7eb717a4 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherJob.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherJob.java
@@ -215,7 +215,7 @@ public FetcherJob(Configuration conf) {
StorageUtils.initReducerJob(currentJob, FetcherReducer.class);
if (numTasks == null || numTasks < 1) {
currentJob.setNumReduceTasks(currentJob.getConfiguration().getInt(
- "mapreduce.job.maps", currentJob.getNumReduceTasks()));
+ "mapreduce.job.reduces", currentJob.getNumReduceTasks()));
} else {
currentJob.setNumReduceTasks(numTasks);
}
@@ -248,7 +248,7 @@ public FetcherJob(Configuration conf) {
* @param shouldResume
* @param numTasks
* number of fetching tasks (reducers). If set to < 1 then use
the
- * default, which is mapreduce.job.maps.
+ * default, which is mapreduce.job.reduces.
* @return 0 on success
* @throws Exception
*/
@@ -268,7 +268,7 @@ public int fetch(String batchId, int threads, boolean
shouldResume,
* @param shouldResume
* @param numTasks
* number of fetching tasks (reducers). If set to < 1 then use
the
- * default, which is mapreduce.job.maps.
+ * default, which is mapreduce.job.reduces.
* @param stmDetect
* If set true, sitemap detection is run.
* @param sitemap
@@ -327,7 +327,7 @@ public int run(String[] args) throws Exception {
+ " -crawlId <id> - the id to prefix the schemas to operate on, \n
\t \t (default: storage.crawl.id)\n"
+ " -threads N - number of fetching threads per task\n"
+ " -resume - resume interrupted job\n"
- + " -numTasks N - if N > 0 then use this many reduce tasks for
fetching \n \t \t (default: mapreduce.job.maps)"
+ + " -numTasks N - if N > 0 then use this many reduce tasks for
fetching \n \t \t (default: mapreduce.job.reduces)"
+ " -sitemap - only sitemap files are fetched, defaults to
false"
+ " -stmDetect - sitemap files are detected from robot.txt file";
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Number of fetcher reducers is misconfigured when the arg not passed
> -------------------------------------------------------------------
>
> Key: NUTCH-2637
> URL: https://issues.apache.org/jira/browse/NUTCH-2637
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 2.3, 2.3.1
> Reporter: Fumio Nakajima
> Priority: Minor
> Fix For: 2.4
>
>
> I'm kind a new to this, so sorry if i'm wrong.
> The thing is the number of fetcher reducers are currently set to the value
> of "mapreduce.job.maps" when the arg not passed. It should be
> "mapreduce.job.reduces".
>
> [https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/fetcher/FetcherJob.java#L216]
> Line: 216, branch-2.X
> {code:java}
> if (numTasks == null || numTasks < 1) {
> currentJob.setNumReduceTasks(currentJob.getConfiguration().getInt(
> "mapreduce.job.maps", currentJob.getNumReduceTasks()));
> } else {
> currentJob.setNumReduceTasks(numTasks);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)