I already read lots of articles about Nutch to parse the content of pdf
documents , but I`m still confused . I try to modify
\nutch-0.7.2\conf\nutch-default.XML like following:
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexcel|mspowerpoint|zip)|index-(basic|more)|query-(basic|site|url)|language-identifier|clustering-carrot2</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
and I add something to /nutch-0.7.2/build.XML like following:
<project name="parse-pdf" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-log4j"/>
<ant target="jar" inheritall="false" dir="../lib-fontbox"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-log4j/*.jar" />
<include name="**/lib-fontbox/*.jar" />
</fileset>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<ant target="deploy" inheritall="false" dir="../lib-log4j"/>
<ant target="deploy" inheritall="false" dir="../lib-fontbox"/>
<ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
<ant target="deploy" inheritall="false" dir="../protocol-file"/>
</target>
<!-- for junit test -->
<mkdir dir="${build.test}/data"/>
<copy file="sample/pdftest.pdf" todir="${build.test}/data"/>
</project>
.I also put FontBox-0.1.0-dev.jar & PDFBox-0.7.3-dev-20060901.jar into
\plugins\parse-pdf,modfying plugin.XML like :
<plugin
id="lib-fontbox"
name="FontBox"
version="0.1.0-dev"
provider-name="org.fontbox">
<runtime>
<library name="FontBox-0.1.0-dev.jar">
<export name="*"/>
</library>
</runtime>
</plugin>
<plugin
id="parse-pdf"
name="Pdf Parse Plug-in"
version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="parse-pdf.jar">
<export name="*"/>
</library>
<library name="PDFBox-0.7.3-dev-20060901.jar"/>
<library name="log4j-1.2.9.jar"/>
<library name="FontBox-0.1.0-dev.jar"/>
</runtime>
<extension id="org.apache.nutch.parse.pdf"
name="PdfParse"
point="org.apache.nutch.parse.Parser">
<implementation id="org.apache.nutch.parse.pdf.PdfParser"
class="org.apache.nutch.parse.pdf.PdfParser"
contentType="application/pdf"
pathSuffix=""/>
</extension>
</plugin>
But when I execute ./nutch crawl there show some messages like "fetch okay
,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
Could someone answer me how to finish it detail as soon as possible?Thanks.
--
View this message in context:
http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6102866
Sent from the Nutch - User forum at Nabble.com.