[Nutch Wiki] Trivial Update of "Features" by LewisJohnMcgibbney

Apache Wiki Tue, 05 Jul 2011 21:42:17 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Features" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Features?action=diff&rev1=19&rev2=20

+ This page act's as an up-to-date resource for features included in the most 
current stable release of Nutch (at time of writing this is 1.3). 
- Missing from the current Nutch documentation (Tutorial, FAQ) is a list of 
features. This wiki page could help, if someone who knows the answers can edit 
it.
- 
- (Please reformat this text and divide into feature lists, questions and 
questions & answers). 
  
  == Features ==
  
+  * Fetching and parsing are done separately by default, this reduces the risk 
of an error corrupting the fetch parse stage of a crawl with Nutch.
+  * Plugins have been overhauled as a direct result of removal of legacy 
Lucene dependency for indexing and search.
+  * The number of plugins for processing various document types being shipped 
with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), 
Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all 
now parsed by the '''Tika''' plugin. The only parser plugins shipped with Nutch 
now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP.
-  * Fetching, parsing and indexation in parallel and/ou distributed
-  * Plugins
-  * Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), 
Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 
tags)
-  * Ontology
-  * Clustering
   * MapReduce ;
   * Distributed filesystem (via Hadoop)
   * Link-graph database

[Nutch Wiki] Trivial Update of "Features" by LewisJohnMcgibbney

Reply via email to