[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Apache Wiki Sat, 13 Jun 2015 10:37:58 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchTutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=79&rev2=80

  ## page was renamed from RunningNutchAndSolr
  ## Lang: En
  == Introduction ==
- Apache Nutch is an open source Web crawler written in Java. By using it, we 
can find Web page hyperlinks in an automated manner, reduce lots of maintenance 
work, for example checking broken links, and create a copy of all the visited 
pages for searching over. That’s where Apache Solr comes in. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward as explained below.
- 
+ Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine 
grained configuration, relying on Apache Hadoop data structures, which are 
great for batch processing.
+ Being pluggable and modular of course has it's benefits, Nutch provides 
extensible interfaces such as Parse, Index and ScoringFilter's for custom 
implementations e.g. Apache Tika for parsing. Additonally, pluggable indexing 
exists for Apache Solr, Elastic Search, SolrCloud, etc.
+ We can find Web page hyperlinks in an automated manner, reduce lots of 
maintenance work, for example checking broken links, and create a copy of all 
the visited pages for searching over. 
+ This tutorial explains how to use Nutch with Apache Solr. Solr is an open 
source full text search framework, with Solr we can search the visited pages 
from Nutch. Luckily, integration between Nutch and Solr is pretty 
straightforward.
  Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr 
integration. It also removes the legacy dependence upon both Apache Tomcat for 
running the old Nutch Web Application and upon Apache Lucene for indexing. Just 
download a binary release from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]].
+ 
+ == Learning Outcomes ==
+ By the end of this tutorial you will
+  * Have a configured local Nutch crawler setup to crawl on one machine
+  * Learned how to understand and configure Nutch runtime configuration 
including seed URL lists, URLFilters, etc.
+  * Have executed a Nutch crawl cycle and viewed the results of the Crawl 
Database
+  * Indexed Nutch crawl records into Apache Solr for full text search
+ 
+ Any issues with this tutorial should be reported to the 
[[http://nutch.apache.org/mailing_lists.html|Nutch user@]] list.
  
  == Table of Contents ==
  <<TableOfContents(3)>>

[Nutch Wiki] Trivial Update of "NutchTutorial" by LewisJohnMcgibbney

Reply via email to