Hi Jason, Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines:
<property> <name>plugin.includes</name> <!-- enable your plugins here --> <value>protocol-(http|file)|urlfilter-regex|parse-(text|html|rss|msword|pdf) |index-basic|query-(basic|site|url)</value\ > <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> Step 2: make sure that the plugin is built: From the <nutch_home> directory, perform the following: # ensure that the core classes are built % ant compile-core # ensure that the plugins are built % ant compile-plugins Note, that the compile-plugins task assumes that your plugin build info is in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin, you'll have to add the ant compile info there, just follow the examples of the other plugins. Step 3: you're done. Good luck. Thanks, Chris ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. > -----Original Message----- > From: Jason Manfield [mailto:[EMAIL PROTECTED] > Sent: Monday, May 02, 2005 10:24 AM > To: [email protected] > Subject: How do I enable PDF/Word etc. parsing in nutch? > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com
