One thing: Create a <nutch_home>/nutch-site.xml instead of modifying nutch-default.xml
- Naomi > -----Original Message----- > From: Chris Mattmann [mailto:[EMAIL PROTECTED] > Sent: Monday, May 02, 2005 1:38 PM > To: [email protected] > Subject: RE: How do I enable PDF/Word etc. parsing in nutch? > > Hi Jason, > > Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines: > > <property> > <name>plugin.includes</name> > > <!-- enable your plugins here --> > > <value>protocol-(http|file)|urlfilter-regex|parse- > (text|html|rss|msword|pdf) > |index-basic|query-(basic|site|url)</value\ > > > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. > </description> > </property> > > Step 2: make sure that the plugin is built: > > From the <nutch_home> directory, perform the following: > > # ensure that the core classes are built > % ant compile-core > > # ensure that the plugins are built > % ant compile-plugins > > Note, that the compile-plugins task assumes that your plugin build info is > in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin, > you'll have to add the ant compile info there, just follow the examples of > the other plugins. > > Step 3: you're done. > > > Good luck. > > > Thanks, > Chris > > > > ______________________________________________ > Chris A. Mattmann > [EMAIL PROTECTED] > Staff Member > Modeling and Data Management Systems Section (387) > Data Management Systems and Technologies Group > > _________________________________________________ > Jet Propulsion Laboratory Pasadena, CA > Office: 171-266B Mailstop: 171-246 > > _______________________________________________________ > > Disclaimer: The opinions presented within are my own and do not reflect > those of either NASA, JPL, or the California Institute of Technology. > > > > -----Original Message----- > > From: Jason Manfield [mailto:[EMAIL PROTECTED] > > Sent: Monday, May 02, 2005 10:24 AM > > To: [email protected] > > Subject: How do I enable PDF/Word etc. parsing in nutch? > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com
