I'm currently using xpdf's pdftotext program to parse pdf, via the parse-ext plugin. It seems much faster than PDFBox.

To try it, copy the attached plugin.xml file to

  build/plugins/parse-ext/plugin.xml

then copy the attached parse-pdf.sh script to

  bin/parse-pdf.sh

and make it executable

  chmod +x bin/parse-pdf.sh

finally, include the parse-ext plugin in your nutch-site.xml.

What do you think?

Doug
<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="parse-ext"
   name="External Parser Plug-in"
   version="1.0.0"
   provider-name="nutch.org">

   <extension-point
      id="org.apache.nutch.parse.Parser"
      name="Nutch Content Parser"/>

   <runtime>
      <library name="parse-ext.jar">
         <export name="*"/>
      </library>
   </runtime>

   <extension id="org.apache.nutch.parse.ext"
              name="ExtParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="ExtParser"
                      class="org.apache.nutch.parse.ext.ExtParser"
                      contentType="application/pdf"
                      pathSuffix=""
                      command="bin/parse-pdf.sh"
                      timeout="30"/>

   </extension>

</plugin>

Attachment: parse-pdf.sh
Description: application/shellscript

Reply via email to