[Nutch-dev] Re: PDF Parsing Revisited

Andy Liu Fri, 01 Apr 2005 07:21:35 -0800

FYI, Ben Litchfield from PDFBox has fixed the bug:

http://sourceforge.net/tracker/index.php?func=detail&aid=1172740&group_id=78314&atid=552832


You can checkout the latest version of PDFBox from CVS and you
shouldn't have any problems with it hanging up fetcher / parser
threads.  I've tested it on about 150,000 PDF files, and I didn't have
any problems.

It is a bit slower than using xpdf and parse-ext, but it's nice having
an all-Java solution.

Should we check in the development jar of PDFBox, or wait until Ben
comes out with the next official release?

Andy

On Mar 29, 2005 3:52 PM, Andy Liu <[EMAIL PROTECTED]> wrote:
> We've been using pdftotext / parse-ext also.  It works well.
> 
> We also ended using pdftotext's -htmlmeta option so we could parse out
> the PDF's title from the resulting HTML.  In some cases, where the
> title cannot be parsed out of the PDF file, we use anchor text as the
> page's title instead.
> 
> On Tue, 29 Mar 2005 12:33:40 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> > I'm currently using xpdf's pdftotext program to parse pdf, via the
> > parse-ext plugin.  It seems much faster than PDFBox.
> >
> > To try it, copy the attached plugin.xml file to
> >
> >    build/plugins/parse-ext/plugin.xml
> >
> > then copy the attached parse-pdf.sh script to
> >
> >    bin/parse-pdf.sh
> >
> > and make it executable
> >
> >    chmod +x bin/parse-pdf.sh
> >
> > finally, include the parse-ext plugin in your nutch-site.xml.
> >
> > What do you think?
> >
> > Doug
> >
> >
> >
>


-------------------------------------------------------
This SF.net email is sponsored by Demarc:
A global provider of Threat Management Solutions.
Download our HomeAdmin security software for free today!
http://www.demarc.com/Info/Sentarus/hamr30
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: PDF Parsing Revisited

Reply via email to