Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "RunningNutchAndSolr" page has been changed by SeanOConnor. The comment on this change is: minor formatting changes to address run-on commands . http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=31&rev2=32 -------------------------------------------------- <str name="qf"> - content^0.5 anchor^1.0 title^1.2 </str> + content^0.5 anchor^1.0 title^1.2 </str> - <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> + <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> @@ -116, +116 @@ -^(https|telnet|file|ftp|mailto): - # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ + # skip some suffixes - # skip URLs containing certain characters as probable queries, etc. -[...@=] + -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ - # allow urls in foofactory.fi domain +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/ + # skip URLs containing certain characters as probable queries, etc. + -[...@=] + + # allow urls in foofactory.fi domain (or lucidimagination.com...) + + +^http://([a-z0-9\-A-Z]*\.)*lucidimagination.com/ + - # deny anything else -. + # deny anything else + + -. '''8.''' Create a seed list (the initial urls to fetch) + mkdir urls - mkdir urls echo "http://www.lucidimagination.com/" > urls/seed.txt + echo "http://www.lucidimagination.com/" > urls/seed.txt '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)

