I am crawling pages using the following commands in a loop iterating 10 times:-
bin/nutch generate crawl/crawldb crawl/segments -topN 1000 seg1=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $seg1 -threads 50 bin/nutch updatedb crawl/crawldb $seg1 I am getting the following errors whenever it tries to parse non-HTML content. Error parsing: http://policydep/cmm.pdf: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf How can I make it parse these type of content while crawling? And if I run the fetch in non-parsing mode how can I make it parse them later and update it in "crawl" folder. Please help.
