Could anyone teache me how to index the title or content of PDF?

Frank Huang Fri, 01 Sep 2006 10:18:22 -0700

I already read lots of articles about Nutch to parse the content of  pdf
documents , but I`m still confused .  I try to modify
\nutch-0.7.2\conf\nutch-default.XML like following:


<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

and I add something to  /nutch-0.7.2/build.XML like following:

<project name="parse-pdf" default="jar-core">
  <import file="../build-plugin.xml"/>
  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-log4j"/>
    <ant target="jar" inheritall="false" dir="../lib-fontbox"/> 
  </target>
  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-log4j/*.jar" />
      <include name="**/lib-fontbox/*.jar" />  
    </fileset>
  </path>
  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-log4j"/>
    <ant target="deploy" inheritall="false" dir="../lib-fontbox"/>  
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
    <ant target="deploy" inheritall="false" dir="../protocol-file"/>
  </target>
  <!-- for junit test -->
  <mkdir dir="${build.test}/data"/>
  <copy file="sample/pdftest.pdf" todir="${build.test}/data"/>
</project>

.I also put FontBox-0.1.0-dev.jar & PDFBox-0.7.3-dev-20060901.jar into
\plugins\parse-pdf,modfying plugin.XML like :

<plugin
   id="lib-fontbox"
   name="FontBox"
   version="0.1.0-dev"
   provider-name="org.fontbox">
   <runtime>
     <library name="FontBox-0.1.0-dev.jar">
        <export name="*"/>
     </library>
   </runtime>
</plugin>

<plugin
   id="parse-pdf"
   name="Pdf Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">


   <runtime>
      <library name="parse-pdf.jar">
         <export name="*"/>
      </library>
      <library name="PDFBox-0.7.3-dev-20060901.jar"/>
      <library name="log4j-1.2.9.jar"/>
      <library name="FontBox-0.1.0-dev.jar"/>
   </runtime>

   <extension id="org.apache.nutch.parse.pdf"
              name="PdfParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="org.apache.nutch.parse.pdf.PdfParser"
                      class="org.apache.nutch.parse.pdf.PdfParser"
                      contentType="application/pdf"
                      pathSuffix=""/>

   </extension>

</plugin>
 
But when I execute ./nutch crawl there show some messages like "fetch okay
,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

Could someone answer me how to finish it detail as soon as possible?Thanks.
-- 
View this message in context: 
http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6102866
Sent from the Nutch - User forum at Nabble.com.

Could anyone teache me how to index the title or content of PDF?

Reply via email to