Re: How to parse PDF files? Deferred parsing possible?

Doğacan Güney Wed, 30 May 2007 23:10:14 -0700

On 5/31/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:

I am crawling pages using the following commands in a loop iterating 10 times:-


   bin/nutch generate crawl/crawldb crawl/segments -topN 1000
   seg1=`ls -d crawl/segments/* | tail -1`
   bin/nutch fetch $seg1 -threads 50
   bin/nutch updatedb crawl/crawldb $seg1

I am getting the following errors whenever it tries to parse non-HTML content.

Error parsing: http://policydep/cmm.pdf: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/pdf


Add plugin parse-pdf to your config (plugin.includes property).


How can I make it parse these type of content while crawling?

And if I run the fetch in non-parsing mode how can I make it parse
them later and update it in "crawl" folder.

Please help.



--
Doğacan Güney

Re: How to parse PDF files? Deferred parsing possible?

Reply via email to