Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Nutch2Tutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/Nutch2Tutorial?action=diff&rev1=11&rev2=12

- = Nutch 2.0 Tutorial =
+ = Nutch 2.X Tutorial =
  {{attachment:nutch_logo_medium.gif}} 
{{http://gora.apache.org/resources/img/gora-logo.png}} 
{{http://hbase.apache.org/images/hbase_logo.png}}
  
- This document describes how to get Nutch 2.0 to use HBase as a storage 
backend for Gora.
  
-  * Grab the latest distribution of Nutch 2.X from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]
-  * Install and configure HBase. You can get it 
[[http://archive.apache.org/dist/hbase/hbase-0.94.14/|here]] ('''N.B.''' Gora 
0.4 uses HBase 0.94.14 we therefore suggest you use this version if possible.
-  * Specify the GORA backend in nutch-site.xml
+ <<TableOfContents(4)>>
+ 
+ == Introduction ==
+ 
+ This document describes how to get Nutch 2.X to use HBase as a storage 
backend for Gora. It is assumed that you have ''a working'' knowledge of 
configuring Nutch 1.X, as currently configuration in 2.X is more complex. It is 
important to take this in to consideration before progressing any further. We 
therefore '''strongly advise''' that you check out the 
[[http://wiki.apache.org/nutch/NutchTutorial|Nutch 1.X tutorial]].
+ 
+ == Obtaining Software and Configuration ==
+ 
+  * Grab the latest distribution of Nutch 2.X from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. '''Do NOT build the 
source yet'''. From now on we will refer to the directory where the Nutch code 
resides as $NUTCH_HOME.
+  * Download and configure HBase 0.94.14. You can get it 
[[http://archive.apache.org/dist/hbase/hbase-0.94.14/|here]] ('''N.B.''' Gora 
0.4 uses HBase 0.94.14 we therefore suggest you use this version if possible. 
If you decide to use another version of HBase please do not be surprised if the 
stack does not work. You should also obtain 
[[http://hbase.apache.org/book/quickstart.html|current documentation for 
HBase]] however please again take into consideration that the version of HBase 
ywe recommend you use may not correlate to the current documentation. Please 
keep this in mind and use your initiative.
+  * Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml along with all 
of the other Configuration options suggested within the 
[[http://wiki.apache.org/nutch/NutchTutorial|Nutch 1.x tutorial]].
  
  {{{
  <property>
@@ -17, +24 @@

  </property>
  }}}
  
-  * Ensure the HBase gora-hbase dependency is available in ivy/ivy.xml
+  * Ensure the HBase gora-hbase dependency is available in 
$NUTCH_HOME/ivy/ivy/ivy.xml
  
  {{{
      <!-- Uncomment this to use HBase as Gora backend. -->
@@ -25, +32 @@

      <dependency org="org.apache.gora" name="gora-hbase" rev="0.4" 
conf="*->default" />
  }}}
  
-  * Ensure that HBaseStore is set as the default datastore in gora.properties
+  * Ensure that HBaseStore is set as the default datastore in 
$NUTCH_HOME/conf/gora.properties. Other documentation for HBaseStore can be 
found [[http://gora.apache.org/current/gora-hbase.html|here]].
  
  {{{
      gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  }}}
  
-  * '''N.B.''' It's probably worth setting all your usual configuration 
settings within nutch-site.xml etc. before progressing.
+  * '''N.B.''' It's probably worth checking and setting all your usual 
configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before 
progressing.
-  * Compile Nutch -> ant runtime
+  * Compile Nutch -> via
+ {{{
+ ant runtime
+ }}}
-  * Make sure HBase is started and working properly as per the quick start 
tutorial [[http://hbase.apache.org/book/quickstart.html|here]]
+  * Make sure HBase is started and working properly as per the 
[[http://hbase.apache.org/book/quickstart.html|quick start tutorial]].
+  * Create a list of URLs as you would do within the Nutch 1.X tutorial.
  
+ == Invoke Nutch ==
+ 
- You should then be able to use it. Try going to'' 
$NUTCH_HOME/runtime/local/bin'' and do :
+ You should then be able to inject URLs into HBase. Try going to 
''$NUTCH_HOME/runtime/local/bin'' and do :
  
  {{{
    nutch inject /someseedDir
    nutch readdb
  }}}
+ 
+ 
  
  '''N.B.''' The crawl command in the bin/nutch script is deprecated. You 
should use individual commands or alternatively use the bin/crawl script... 
which effectively chains together individual commands.
  

Reply via email to