Help please trying to crawl local file system

jim shirreffs Thu, 05 Apr 2007 13:08:55 -0700

I googled and googled and goolged I am trying to crawl my local file systemand can't seem to get it right.


I use this command


bin/mutch crawl urls -dir crawl

My urls dir contains one file (files) that looks like this

file:///c:/joms

c:/joms exists

I've modified the config file crawl-urlfilter.txt

#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):

# skip everything else ..... web spaces
#-.
+.*


And the config file nutch-site.xml adding

<property>
 <name>plugin.includes</name>
 
<value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
</property>
<property>
 <name>file.content.limit</name>
 <value>-1</value>
</property>
</configuration>


And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.

# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to breakloops

-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

I don't get any errors but nothing gets crawled either. If anyone can pointout my mistake(s) I would greatly appreciate it.


thanks in advance

jim s

ps it would also be nice to know this email is getting into the nutch-usersmailing list

Help please trying to crawl local file system

Reply via email to