Hope someone can help. I'd like to index and search only a single
directory of my website. Doesn't work so far (both building the index
and consequent searches). Here's my config :-
Url of files to index : http://localhost:8080/mytest/filestore
a) Under the nutch root directory (i.e. ~/nutch), I created a file
urls/mytest that contains just this entry :-
http://localhost:8080/mytest/filestore
b) Edited conf/nutch-site.xml to have these extra entries (included pdf
to be parsed) :-
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
added this line for my domain :-
+^http://([a-z0-9]*\.)*localhost:8080/
The filestore directory contains lots of pdfs but executing :-
~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
the 0.8 tutorial) does not index the files.
Any help much appreciated !