[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343307#comment-16343307 ]
ASF GitHub Bot commented on NUTCH-2501: --------------------------------------- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r164417155 ########## File path: src/bin/crawl ########## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: In [local mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation) all reducer tasks run in a single JVM instance. Only in [pseudo-distributed mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation) this could make some sense, given that all remaining resources (eg. number of CPUs) make it possible to run all reduce tasks in parallel. In distributed mode you want to define the max. heap size based on the configuration of your cluster nodes, because that defines how many parallel tasks can be run on every node (in combination with other resource limits). The heap size configured for a single task is usually used to define what is required to run the task without running into an out-of-memory error. The Yarn resource manager verifies that the heap size configured for the job tasks does not overflow the resource limits configured on the cluster nodes. Otherwise the job will fail. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > ------------------------------------------------------------------ > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement > Reporter: Moreno Feltscher > Assignee: Lewis John McGibbney > Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)