[Nutch-dev] Re: PDF Parsing Revisited

Andy Liu Tue, 29 Mar 2005 12:53:29 -0800

We've been using pdftotext / parse-ext also.  It works well.  

We also ended using pdftotext's -htmlmeta option so we could parse out
the PDF's title from the resulting HTML.  In some cases, where the
title cannot be parsed out of the PDF file, we use anchor text as the
page's title instead.


On Tue, 29 Mar 2005 12:33:40 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> I'm currently using xpdf's pdftotext program to parse pdf, via the
> parse-ext plugin.  It seems much faster than PDFBox.
> 
> To try it, copy the attached plugin.xml file to
> 
>    build/plugins/parse-ext/plugin.xml
> 
> then copy the attached parse-pdf.sh script to
> 
>    bin/parse-pdf.sh
> 
> and make it executable
> 
>    chmod +x bin/parse-pdf.sh
> 
> finally, include the parse-ext plugin in your nutch-site.xml.
> 
> What do you think?
> 
> Doug
> 
> 
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: PDF Parsing Revisited

Reply via email to