Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by JeffRitchie:
http://wiki.apache.org/nutch/nutch-0%2e8-dev/bin/nutch_parse

------------------------------------------------------------------------------
   None.
  
  === Caveats and Notes ===
-  None.
+  The Parser depends upon a number of plugins to parse the various documents 
fetched from a crawl.  Document types supported and the plugins needed are as 
follows:[[BR]][[BR]]
+ 
+  ||'''Content-type'''||'''Plugin'''||'''Notes'''||
+  ||'''text/html'''||parse-html||Parses html documents using NekoHTML or 
!TagSoup||
+  ||'''application/x-javascript'''||parse-js||Parses !JavaScript Documents 
(.js).||
+  ||'''audio/mpeg'''||parse-mp3||Parses MP3 Audio Documents (.mp3).||
+  ||'''application/vnd.ms-excel'''||parse-msexcel||Parses MSExcel Documents 
(.xls).||
+  ||'''application/vnd.ms-powerpoint'''||parse-mspowerpoint||Parses 
MSPower!Point Documents||
+  ||'''application/msword'''||parse-msword||Parses MSWord Documents||
+  ||'''application/rss+xml'''||parse-rss||Parses RSS Documents (.rss)||
+  ||'''application/rtf'''||parse-rtf||Parses RTF Documents (.rtf)||
+  ||'''application/pdf'''||parse-pdf||Parses PDF Documents||
+  ||'''application/x-shockwave-flash'''||parse-swf||Parses Flash Documents 
(.swf)||
+  ||'''text-plain'''||parse-text||Parses Text Documents (.txt)||
+  ||'''application/zip'''||parse-zip||Parses Zip Documents (.zip)||
+  ||'''other types'''||parse-ext||Parses Documents with external commands 
based upon content-type or pathSuffix||
+ 
+ By default only text,html and js are enabled.  The other plugins need to be 
enabled in nutch-site.xml.
  
  DevelopmentCommandLineOptions
  

Reply via email to