Hi, I'm evaluating Nutch as a search platform for a large Icelandic website. The website has a quite large collection of Adobe Acrobat documents (PDF) stored on a Lotus Domino server.
I run nutch with ./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/ -depth 9999 -topN 9999 Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the following error: Error parsing: http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/pdf url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf With best regards, -- Sævaldur Arnar Gunnarsson System Administrator | RHCE Hugsmiðja ehf. Snorrabraut 56 | 105 Reykjavík S: 550 0900 | G: 659 0007
smime.p7s
Description: S/MIME cryptographic signature