RE: How do I enable PDF/Word etc. parsing in nutch?

Chris Mattmann Mon, 02 May 2005 10:39:11 -0700

Hi Jason,

 Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines:


<property>
  <name>plugin.includes</name>

<!-- enable your plugins here -->
 
<value>protocol-(http|file)|urlfilter-regex|parse-(text|html|rss|msword|pdf)
|index-basic|query-(basic|site|url)</value\
>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

 Step 2: make sure that the plugin is built:

  From the <nutch_home> directory, perform the following:
 
  # ensure that the core classes are built
  % ant compile-core

  # ensure that the plugins are built
  % ant compile-plugins

Note, that the compile-plugins task assumes that your plugin build info is
in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin,
you'll have to add the ant compile info there, just follow the examples of
the other plugins.

 Step 3: you're done.


Good luck.


Thanks,
  Chris



______________________________________________
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246

_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Jason Manfield [mailto:[EMAIL PROTECTED]
> Sent: Monday, May 02, 2005 10:24 AM
> To: [email protected]
> Subject: How do I enable PDF/Word etc. parsing in nutch?
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com

RE: How do I enable PDF/Word etc. parsing in nutch?

Reply via email to