I googled and googled and goolged I am trying to crawl my local file system and can't seem to get it right.
I use this command bin/mutch crawl urls -dir crawl My urls dir contains one file (files) that looks like this file:///c:/joms c:/joms exists I've modified the config file crawl-urlfilter.txt #-^(file|ftp|mailto|sw|swf): -^(http|ftp|mailto|sw|swf): # skip everything else ..... web spaces #-. +.* And the config file nutch-site.xml adding <property> <name>plugin.includes</name> <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value> </property> <property> <name>file.content.limit</name> <value>-1</value> </property> </configuration> And lastly I've modified regex-urlfilter.txt #file systems +^file:///c:/top/directory/ -. # skip file: ftp: and mailto: urls #-^(file|ftp|mailto): -^(http|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept anything else +. I don't get any errors but nothing gets crawled either. If anyone can point out my mistake(s) I would greatly appreciate it. thanks in advance jim s ps it would also be nice to know this email is getting into the nutch-users mailing list ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
