[Nutch Wiki] Update of "NutchHadoopSingleNodeTutorial" by OmkarReddy

Apache Wiki Thu, 09 Nov 2017 04:56:55 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopSingleNodeTutorial" page has been changed by OmkarReddy:
https://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial?action=diff&rev1=7&rev2=8

  
  '''1. Step: Download and install Hadoop in pseudo-distributed mode, as 
explained here:'''
  
-  [[http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html| Hadoop Single 
Node Setup]].
+  
[[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html|
 Hadoop Single Node Setup]].
  
  Here, it’s important to set up ''HADOOP_HOME'' to point to the root of the 
hadoop installation, 
  similar to ''JAVA_HOME'' it has to be set globally, so the hadoop start-up 
script can be called from anywhere. 
  
  (Check this by running: ' ''echo $HADOOP_HOME'' ' in the console, which 
should return the path to the root of your hadoop installation.)
  
- '''''N.B.''''' Make sure your hadoop installation is working correctly before 
trying to integrate Nutch!
+ '''''N.B.''''' Make sure your hadoop installation is working correctly by 
running the examples as mentioned in the link above before trying to integrate 
Nutch!
  
  E.g. try to connect to the jobtracker at: http://localhost:50030/. 
  
@@ -22, +22 @@

  
  '''2. Step: Download and install Nutch 1.x:'''
  
- Download a stable source version e.g. apache-nutch-1.8-src.zip from 
http://nutch.apache.org/downloads.html.
+ Download a stable source version e.g. apache-nutch-1.13-src.zip from 
http://nutch.apache.org/downloads.html.
  
- For installation of apache-nutch-1.8-src.zip:
+ For installation of apache-nutch-1.13-src.zip:
  
-  * Unzip and over the terminal cd into the freshly exracted folder 
''apache-nutch-1.8''
+  * Unzip and over the terminal cd into the freshly exracted folder 
''apache-nutch-1.13''
  
   * Run ‘ant runtime’ in this folder
  
  This command builds the runtime environment, where ''runtime/local'' stores 
all
  configuration files, libraries etc. but does not use the hadoop version, 
which has been set up here (pseudo-distributed mode), but the local 
(standalone) non-distributed version, that is often used for debugging and 
described in more detail here: 
- [[http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html#Local| Hadoop 
Standalone Setup]].
+ 
[[https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation|
 Hadoop Standalone Setup]].
  
  
  However, the nutch-job jar used for hadoop in pseudo-distributed mode lives 
in 
  ''runtime/deploy/''. 
  As a consequence, any modification to the configuration files in 
''$NUTCH/conf'' (the configuration directory at the root) require
- a re-build with ‘ant’ to make sure the changes become part of the nutch-job 
jar as well.   
+ a re-build with ‘ant’ to make sure the changes become part of the nutch-job 
jar as well.
+ 
+ '''''N.B.''''' Make sure that the property mapreduce.framework.name in 
etc/hadoop/mapred-site.xml is set as mentioned in the hadoop documentation 
above.    
  
  See: NutchTutorial on how to set up a specific configuration and run a crawl.

[Nutch Wiki] Update of "NutchHadoopSingleNodeTutorial" by OmkarReddy

Reply via email to