[Nutch Wiki] Trivial Update of "Features" by LewisJohnMcgibbney

Apache Wiki Tue, 05 Jul 2011 21:54:01 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Features" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Features?action=diff&rev1=20&rev2=21

  
  == Questions and Answers ==
  
-  *What kind of searches does Nutch support? (quoted, nested, truncation, 
wildcarding [and where], Boolean),
-     * "...." (phrase search?), + (what is this for?), - (negation) and 
fieldname:term.  No "AND" or "OR".  The and-logic is implied.
- 
-  *Is stemming an option?
-     * According to the [[http://www.lucenebook.com/|Lucene in Action]] book: 
"Nutch does not use stemming or term aliasing of any kind.  Search engines have 
not historically done much stemming, but it is a question that comes up 
regularly." -- page 329
- 
-  *What kind of stemming does Nutch use? (and can you add exceptions/changes?)
-     * See previous answer :)
- 
-  *Does Nutch support Boolean operators? (can you use Google-like plus or 
minus or are you stuck with 1990s terms?)
-     * No
- 
   *How does the search engine handle punctuation and special characters? (and 
what's configurable?)
      * They are treated like a space.
  
   *Which document formats are supported?
-   * Guessing from the names of the available parser plugins, this is probably 
it.  However, only the plain text and HTML are enabled by default.  Edit 
conf/nutch-site.xml and change the value of plugin.includes property to include 
the plugins for the document types that you want Nutch to handle:
+   * This is directly linked to the available parser plugins mentioned above, 
however only some are enabled by default as most of the parsing is now 
delegated to Tika in an attempt to clean up the Nutch codebase.  Edit 
conf/nutch-site.xml and change the value of plugin.includes property to include 
the plugins for the document types that you want Nutch to handle. Additionally 
have a look at conf/parse-plugins.xml for more details of plugin 
implementations. To recap:
-    * Plain Text (plugin: parse-text)
+    * Plain Text (plugin: tika)
-    * HTML (parse-html)
+    * HTML/XHTML+XML (parse-html/tika)
-    * XML (parse-xml) uses XPath and namespaces to do the mapping between XML 
elements and Lucene fields. 
+    * XML (parse-Tika/feed) uses XPath and namespaces to do the mapping 
between XML elements and index fields. 
     * Java``Script (for extracting links only?) (parse-js)
-    * OpenOfice.org ODF (parse-oo) parses Open Office and Star Office 
documents.
+    * OpenOfice.org ODF (parse-tika) parses Open Office and Star Office 
documents.
-    * Microsoft Power Point, the .ppt file (parse-mspowerpoint)
+    * Microsoft Power Point, the .ppt file (parse-tika)
-    * Microsoft Word, the .doc file (parse-msword)
+    * Microsoft Word, the .doc file (parse-tika)
-    * Adobe PDF (parse-pdf)
+    * Adobe PDF (parse-tika)
-    * RSS (parse-rss)
+    * RSS (parse-feed/tika)
-    * RTF (parse-rtf)
+    * RTF (parse-tika)
-    * MP3 (?) Is there any text in MP3? (parse-mp3) (JR: Sure, the mp3 itself 
contains the ID3v1 or ID3v2 tags which contain song information like
+    * MP3 (parse-tika) The mp3 itself contains the ID3v1 or ID3v2 tags which 
contain metadata song information like
-      title, artist, album, comments, etc. The useful information needed to 
search mp3s)
+      title, artist, album, comments, etc. The useful information needed to 
search mp3s
-    * ZIP (?) This seems to expand the zip of plain text files and return the 
concatenated text. (parse-zip)
+    * ZIP (parse-zip) This seems to expand the zip of plain text files and 
return the concatenated text. 
- 
  
  == Questions without Answers ==

[Nutch Wiki] Trivial Update of "Features" by LewisJohnMcgibbney

Reply via email to