Newbie query: problem indexing pdf files

Gareth Gale Fri, 28 Sep 2007 05:27:31 -0700

Hope someone can help. I'd like to index and search only a singledirectory of my website. Doesn't work so far (both building the indexand consequent searches). Here's my config :-


Url of files to index : http://localhost:8080/mytest/filestore

a) Under the nutch root directory (i.e. ~/nutch), I created a fileurls/mytest that contains just this entry :-


http://localhost:8080/mytest/filestore

b) Edited conf/nutch-site.xml to have these extra entries (included pdfto be parsed) :-


<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.

If this value is nonnegative (>=0), content longer than it will betruncated;

  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpointsplugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS pleaseenableprotocol-httpclient, but be aware of possible intermittent problemswith the

  underlying commons-httpclient library.
  </description>
</property>

c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files andadded this line for my domain :-


+^http://([a-z0-9]*\.)*localhost:8080/

The filestore directory contains lots of pdfs but executing :-

~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken fromthe 0.8 tutorial) does not index the files.


Any help much appreciated !

Newbie query: problem indexing pdf files

Reply via email to