I googled and googled and goolged I am trying to crawl my local file system
and can't seem to get it right.
I use this command
bin/mutch crawl urls -dir crawl
My urls dir contains one file (files) that looks like this
file:///c:/joms
c:/joms exists
I've modified the config file crawl-urlfilter.txt
#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):
# skip everything else ..... web spaces
#-.
+.*
And the config file nutch-site.xml adding
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
</configuration>
And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.
# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept anything else
+.
I don't get any errors but nothing gets crawled either. If anyone can point
out my mistake(s) I would greatly appreciate it.
thanks in advance
jim s
ps it would also be nice to know this email is getting into the nutch-users
mailing list