FYI, Ben Litchfield from PDFBox has fixed the bug: http://sourceforge.net/tracker/index.php?func=detail&aid=1172740&group_id=78314&atid=552832
You can checkout the latest version of PDFBox from CVS and you shouldn't have any problems with it hanging up fetcher / parser threads. I've tested it on about 150,000 PDF files, and I didn't have any problems. It is a bit slower than using xpdf and parse-ext, but it's nice having an all-Java solution. Should we check in the development jar of PDFBox, or wait until Ben comes out with the next official release? Andy On Mar 29, 2005 3:52 PM, Andy Liu <[EMAIL PROTECTED]> wrote: > We've been using pdftotext / parse-ext also. It works well. > > We also ended using pdftotext's -htmlmeta option so we could parse out > the PDF's title from the resulting HTML. In some cases, where the > title cannot be parsed out of the PDF file, we use anchor text as the > page's title instead. > > On Tue, 29 Mar 2005 12:33:40 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > > I'm currently using xpdf's pdftotext program to parse pdf, via the > > parse-ext plugin. It seems much faster than PDFBox. > > > > To try it, copy the attached plugin.xml file to > > > > build/plugins/parse-ext/plugin.xml > > > > then copy the attached parse-pdf.sh script to > > > > bin/parse-pdf.sh > > > > and make it executable > > > > chmod +x bin/parse-pdf.sh > > > > finally, include the parse-ext plugin in your nutch-site.xml. > > > > What do you think? > > > > Doug > > > > > > > ------------------------------------------------------- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security software for free today! http://www.demarc.com/Info/Sentarus/hamr30 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
