Hope someone can help. I'd like to index and search only a single directory of my website. Doesn't work so far (both building the index and consequent searches). Here's my config :-

Url of files to index : http://localhost:8080/mytest/filestore

a) Under the nutch root directory (i.e. ~/nutch), I created a file urls/mytest that contains just this entry :-

http://localhost:8080/mytest/filestore

b) Edited conf/nutch-site.xml to have these extra entries (included pdf to be parsed) :-

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and added this line for my domain :-

+^http://([a-z0-9]*\.)*localhost:8080/

The filestore directory contains lots of pdfs but executing :-

~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from the 0.8 tutorial) does not index the files.

Any help much appreciated !

Reply via email to