[Nutch-dev] New pdf and doc support

Jacques Grove Tue, 29 Jun 2004 11:30:35 -0700

Hi all,

Great job on the new pdf and doc file support (which arrived just about
a week before I wanted to start hacking on it).  Anyway, I have some
comments, based on the intranet crawl/search I use nutch for.  Neither
are directly nutch's fault, but I wanted to mention them for the record:


- The pdf engine nutch uses, PDFBox, doesn't do very well on a (largish)
subset of real-world pdf files.  The most common errors I see are (from
the crawler):

040628 141423 fetch of http://foo.com/foo.pdf failed with:
net.nutch.parse.ParseException: Can't be handled as pdf document.
java.io.IOException: Error: No 'ToUnicode' and no 'Encoding' for Font

and

040628 141440 fetch of http://foo.com/bar.pdf failed with:
net.nutch.parse.ParseException: Can't be handled as pdf document.
java.io.IOException: expected='endobj' actual=''
[EMAIL PROTECTED]

These two type of errors cause errors on about 15% of all my pdf files.


- The Apache "poi" engine to parse msword files has similar problems
with a subset of my msword files.  I see fetcher errors such as:

040628 141601 fetch of http://foo.com/foo.doc failed with:
net.nutch.parse.ParseException: Can't be handled as msword document.
java.util.NoSuchElementException

These files all open fine in MSOffice and openoffice.

I have errors like these on about 10% of my msword files.

So, I'd guess that it will still take a bit of work from our PDFBox and
Apache-poi friends until everything works near-100%.  BTW, I'm happy
with the figures above, which are good enough for my uses.


Keep up the good work!

Jacques


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] New pdf and doc support

Reply via email to