Hi all, Great job on the new pdf and doc file support (which arrived just about a week before I wanted to start hacking on it). Anyway, I have some comments, based on the intranet crawl/search I use nutch for. Neither are directly nutch's fault, but I wanted to mention them for the record:
- The pdf engine nutch uses, PDFBox, doesn't do very well on a (largish) subset of real-world pdf files. The most common errors I see are (from the crawler): 040628 141423 fetch of http://foo.com/foo.pdf failed with: net.nutch.parse.ParseException: Can't be handled as pdf document. java.io.IOException: Error: No 'ToUnicode' and no 'Encoding' for Font and 040628 141440 fetch of http://foo.com/bar.pdf failed with: net.nutch.parse.ParseException: Can't be handled as pdf document. java.io.IOException: expected='endobj' actual='' [EMAIL PROTECTED] These two type of errors cause errors on about 15% of all my pdf files. - The Apache "poi" engine to parse msword files has similar problems with a subset of my msword files. I see fetcher errors such as: 040628 141601 fetch of http://foo.com/foo.doc failed with: net.nutch.parse.ParseException: Can't be handled as msword document. java.util.NoSuchElementException These files all open fine in MSOffice and openoffice. I have errors like these on about 10% of my msword files. So, I'd guess that it will still take a bit of work from our PDFBox and Apache-poi friends until everything works near-100%. BTW, I'm happy with the figures above, which are good enough for my uses. Keep up the good work! Jacques ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
