[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Apache Wiki Mon, 02 Mar 2015 10:04:20 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=78&rev2=79

   * Java Runtime/Development Environment (1.7)
   * (Source build only) Apache Ant: http://ant.apache.org/
  
- == 1. Install Nutch ==
+ == Install Nutch ==
  === Option 1: Setup Nutch from a binary distribution ===
   * Download a binary package (`apache-nutch-1.X-bin.zip`) from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
   * Unzip your binary Nutch package. There should be a folder 
`apache-nutch-1.X`.
@@ -43, +43 @@

   * config files should be modified in `apache-nutch-1.X/runtime/local/conf/`
   * `ant clean` will remove this directory (keep copies of modified config 
files)
  
- == 2. Verify your Nutch installation ==
+ == Verify your Nutch installation ==
   * run "`bin/nutch`" - You can confirm a correct installation if you see 
something similar to the following:
  
  {{{
@@ -93, +93 @@

  
  Note that the `LMC-032857` above should be replaced with your machine name.
  
- == 3. Crawl your first website ==
+ == Crawl your first website ==
  Nutch requires two configuration changes before a website can be crawled:
  
   1. Customize your crawl properties, where at a minimum, you provide a name 
for your crawler for external servers to recognize
   1. Set a seed list of URLs to crawl
  
- === 3.1 Customize your crawl properties ===
+ === Customize your crawl properties ===
   * Default crawl properties can be viewed and edited within 
`conf/nutch-default.xml `- where most of these can be used without modification
   * The file `conf/nutch-site.xml` serves as a place to add your own custom 
crawl properties that overwrite `conf/nutch-default.xml`. The only required 
modification for this file is to override the `value` field of the 
`http.agent.name     `
    . i.e. Add your agent name in the `value` field of the `http.agent.name` 
property in `conf/nutch-site.xml`, for example:
@@ -110, +110 @@

   <value>My Nutch Spider</value>
  </property>
  }}}
- === 3.2 Create a URL seed list ===
+ === Create a URL seed list ===
   * A URL seed list includes a list of websites, one-per-line, which nutch 
will look to crawl
   * The file `conf/regex-urlfilter.txt` will provide Regular Expressions that 
allow nutch to filter and narrow the types of web resources to crawl and 
download
  
@@ -272, +272 @@

       Usage: bin/nutch solrclean <crawldb> <solrurl>
       Example: /bin/nutch solrclean crawl/crawldb/ http://localhost:8983/solr
  }}}
- === 3.5. Using the crawl script ===
+ === Using the crawl script ===
- If you have followed the 3.2 section above on how the crawling can be done 
step by step, you might be wondering how a bash script can be written to 
automate all the process described above.
+ If you have followed the section above on how the crawling can be done step 
by step, you might be wondering how a bash script can be written to automate 
all the process described above.
  
  Nutch developers have written one for you :), and it is available at 
[[bin/crawl]].
  
@@ -283, +283 @@

  }}}
  The crawl script has lot of parameters set, and you can modify the parameters 
to your needs. It would be ideal to understand the parameters before setting up 
big crawls.
  
- == 4. Setup Solr for search ==
+ == Setup Solr for search ==
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
-  * unzip to `$HOME/apache-solr-3.X`, we will now refer to this as 
`${APACHE_SOLR_HOME}`
+  * unzip to `$HOME/apache-solr`, we will now refer to this as 
`${APACHE_SOLR_HOME}`
   * `cd ${APACHE_SOLR_HOME}/example`
   * `java -jar start.jar`
  
- == 5. Verify Solr installation ==
+ == Verify Solr installation ==
  After you started Solr admin console, you should be able to access the 
following links:
  
  {{{
  http://localhost:8983/solr/#/
  }}}
- == 6. Integrate Solr with Nutch ==
+ == Integrate Solr with Nutch ==
  We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed URL(s). Below are the steps to delegate 
searching to Solr for links to be searchable:
  
   * Backup the original Solr example schema.xml:<<BR>>

[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Reply via email to