I googled and googled and goolged I am trying to crawl my local file system 
and can't seem to get it right.

I use this command

bin/mutch crawl urls -dir crawl

My urls dir contains one file (files) that looks like this

file:///c:/joms

c:/joms exists

I've modified the config file crawl-urlfilter.txt

#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):

# skip everything else ..... web spaces
#-.
+.*


And the config file nutch-site.xml adding

<property>
  <name>plugin.includes</name>
  
<value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
</property>
<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>
</configuration>


And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.

# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.


I don't get any errors but nothing gets crawled either. If anyone can point 
out my mistake(s) I would greatly appreciate it.

thanks in advance

jim s


ps it would also be nice to know this email is getting into the nutch-users 
mailing list





-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to