Re: [Nutch-dev] New pdf and doc support

john Tue, 29 Jun 2004 12:37:04 -0700

On Mon, Jun 28, 2004 at 09:25:04PM -0700, Jacques Grove wrote:
> Hi all,
> 
> Great job on the new pdf and doc file support (which arrived just about
> a week before I wanted to start hacking on it).  Anyway, I have some
> comments, based on the intranet crawl/search I use nutch for.  Neither
> are directly nutch's fault, but I wanted to mention them for the record:
> 
> - The pdf engine nutch uses, PDFBox, doesn't do very well on a (largish)
> subset of real-world pdf files.  The most common errors I see are (from
> the crawler):


Yes, neither PDFBox nor poi can handle 100% of *pdf or *.doc out there.
If there are better libs that people like, we can always switch.

However, you definitely want make sure that file contents are not
truncated when crawled (by default, nutch truncates at 65536 bytes,
check ./conf/nutch-deafult.xml), since neither lib currently can deal with 
incomplete files.

I am in the process of testing a few finished codes that allow
external programs to be used as parsers. This might add a little
flexibility.

John


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] New pdf and doc support

Reply via email to