[
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340743#comment-16340743
]
ASF GitHub Bot commented on NUTCH-2501:
---------------------------------------
sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take
NUTCH_HEAPSIZE into account when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164054172
##########
File path: src/bin/crawl
##########
@@ -171,6 +175,8 @@ fi
CRAWL_PATH="$1"
LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
Review comment:
Why should the heap size depend on the number of reducers? For a large-scale
crawl the reducers will run independently on different nodes, ev. also
sequentially if there are not enough computing resources available. Since
mapred.child.java.opts is also used for the map tasks and it's often not
possible to force a fix number of map tasks, it's better to define the heap
size per task (usually via mapreduce.map.java.opts and
mapreduce.reduce.java.opts).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> ------------------------------------------------------------------
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
> Issue Type: Improvement
> Reporter: Moreno Feltscher
> Assignee: Lewis John McGibbney
> Priority: Major
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)