[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343307#comment-16343307
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---------------------------------------

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r164417155
 
 

 ##########
 File path: src/bin/crawl
 ##########
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   In [local 
mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation)
 all reducer tasks run in a single JVM instance. Only in [pseudo-distributed 
mode](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)
 this could make some sense, given that all remaining resources (eg. number of 
CPUs) make it possible to run all reduce tasks in parallel. In distributed mode 
you want to define the max. heap size based on the configuration of your 
cluster nodes, because that defines how many parallel tasks can be run on every 
node (in combination with other resource limits). The heap size configured for 
a single task is usually used to define what is required to run the task 
without running into an out-of-memory error. The Yarn resource manager verifies 
that the heap size configured for the job tasks does not overflow the resource 
limits configured on the cluster nodes. Otherwise the job will fail.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2501
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2501
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
>            Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to