[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

ASF GitHub Bot (JIRA) Thu, 10 May 2018 05:16:54 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470312#comment-16470312
 ]


ASF GitHub Bot commented on NUTCH-2501:
---------------------------------------

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r187313029
 
 

 ##########
 File path: src/bin/crawl
 ##########
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   To keep it simple, I would suggest to
   1. remove the explicit `mapred.child.java.opts`
   2. and add a new environment variable `NUTCH_HADOOP_OPTS` which is used to 
add further options to the commonOptions
   
   In local mode it's still sufficient to use the environment variable 
NUTCH_OPTS to set the Java heap size. In distributed mode, this would allow 
fine-grained memory settings, e.g.
   ```
   export NUTCH_HADOOP_OPTS="-Dmapreduce.map.memory.mb=8192 
-Dmapreduce.map.java.opts=-Xmx7680m"
   ```
   These are required, as the Java heap must be lower than the total memory 
allocated for a map/reduce task.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2501
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2501
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
>            Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

Reply via email to