[Nutch-dev] Alternative Content Types

Mike Richmond Tue, 07 Dec 2004 08:40:28 -0800

To Whom It May Concern:

I am a Java developer looking to get involved with a project. I came across your site and noticed that there is a lot of attention paid to PDF parsing. I’m curious why PDF file parsing has not yet been added to Nutch. There seem to be a number of open source (GPL’d) PDF parsers:

PDFBox (http://pdfbox.org)

XPDF (http://www.foolabs.com/xpdf/)

Pdftohtml (http://pdftohtml.sourceforge.net)

Etc…

Is there a reason that these are not used, or are you just waiting for someone to implement it?

Regards,

Mike Richmond

[Nutch-dev] Alternative Content Types

Reply via email to