Two plugins, parse-pdf and parse-msword doing the desired job already
exist. They are, normally, already compiled and available in your
${NUTCH_HOME}/plugins directory. What you must not forget is to include
them when crawling documents, this is done with the help of the
plugins.includes property available inside your
${NUTCH_HOME}/conf/nutch-site.xml file. Here's an example of this
property. As you can see, both parse-pdf and parse-msword are included.

<property>
        <name>plugin.includes</name>
        <value>protocol-(httpclient|file)|urlfilter-(regex)|parse-
(text|html|js|pdf|msword)|index-(basic)|query-
(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|ba
sic)
        </value>        
</property>


David



-----Original Message-----
From: plat hpc [mailto:[EMAIL PROTECTED] 
Sent: mardi, 10. juin 2008 07:17
To: [email protected]
Subject: How to crawl pdf?

Hi,

I have followed the tutorial to setup my Nutch, up and running.
Currently it
is able to crawl php files, but not the pdf files.

Can anyone please advise how can I setup or configure to make it crawl
onto
pdf and word docs?

Thanks.

Reply via email to