Two plugins, parse-pdf and parse-msword doing the desired job already
exist. They are, normally, already compiled and available in your
${NUTCH_HOME}/plugins directory. What you must not forget is to include
them when crawling documents, this is done with the help of the
plugins.includes property available inside your
${NUTCH_HOME}/conf/nutch-site.xml file. Here's an example of this
property. As you can see, both parse-pdf and parse-msword are included.
<property>
<name>plugin.includes</name>
<value>protocol-(httpclient|file)|urlfilter-(regex)|parse-
(text|html|js|pdf|msword)|index-(basic)|query-
(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|ba
sic)
</value>
</property>
David
-----Original Message-----
From: plat hpc [mailto:[EMAIL PROTECTED]
Sent: mardi, 10. juin 2008 07:17
To: [email protected]
Subject: How to crawl pdf?
Hi,
I have followed the tutorial to setup my Nutch, up and running.
Currently it
is able to crawl php files, but not the pdf files.
Can anyone please advise how can I setup or configure to make it crawl
onto
pdf and word docs?
Thanks.