$ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3 URLS: http://node5:8080/docs/ http://node5:8080/throughmyeyes/ http://node5:8080/docs2/ http://node5:8080/throughmyeyes2/ http://node5:8080/docs3/ http://node5:8080/throughmyeyes3/ http://node5:8080/empty-1.html http://node5:8080/empty-2.html http://node5:8080/empty-3.html http://node5:8080/empty-4.html http://node5:8080/empty-5.html http://node5:8080/empty-6.html http://node5:8080/empty-7.html http://node5:8080/empty-8.html http://node5:8080/empty-9.html http://node5:8080/empty-10.html http://node5:8080/empty-11.html http://node5:8080/empty-12.html http://node5:8080/empty-13.html http://node5:8080/empty-14.html http://node5:8080/empty-15.html http://node5:8080/empty-16.html http://node5:8080/empty-17.html http://node5:8080/empty-18.html http://node5:8080/empty-19.html http://node5:8080/empty-20.html http://node5:8080/empty-21.html http://node5:8080/empty-22.html http://node5:8080/empty-23.html http://node5:8080/empty-24.html http://node5:8080/empty-25.html http://node5:8080/empty-26.html http://node5:8080/empty-27.html http://node5:8080/empty-28.html http://node5:8080/empty-29.html http://node5:8080/empty-30.html http://node5:8080/empty-31.html http://node5:8080/empty-32.html http://node5:8080/empty-33.html http://node5:8080/empty-34.html http://node5:8080/empty-35.html http://node5:8080/empty-36.html http://node5:8080/empty-37.html http://node5:8080/empty-38.html http://node5:8080/empty-39.html http://node5:8080/empty-40.html
conf/crawl-urlfilter.txt (comments removed): -^(file|ftp|mailto): -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ [EMAIL PROTECTED] -.*(/.+?)/.*?\1/.*?\1/ +^http://node[0-9].vanilla7-([a-z0-9]*\.)* +^http://node[0-9].vanilla7pc600-([a-z0-9]*\.)* +^http://node[0-9]:8080/ +^http://([a-z0-9]*\.)*apache.org (also tried this with '+*', '+.', didn't work either) nutch-site.xml: <property> <name>http.agent.name</name> <value>NutchCrawler</value> </property> <property> <name>http.agent.version</name> <value>0.9</value> </property> nutch-default.xml: <property> <name>http.robots.agents</name> <value>*</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.robots.403.allow</name> <value>true</value> <description>Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden.</description> </property> <property> <name>http.agent.description</name> <value>ExperimentalCrawler</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>http://lucene.apache.org/nutch/</value> <description>A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. </description> </property> <property> <name>http.agent.email</name> <value>[EMAIL PROTECTED]</value> <description>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. </description> </property> <property> <name>http.agent.version</name> <value>Nutch-0.9</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> # (also tried this with urlfilter-crawl included, didn't work either) </property> <property> <name>plugin.excludes</name> <value></value> </property> regex-urlfilter.txt: # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept anything else +. On Wed, Feb 20, 2008 at 5:20 PM, John Mendenhall <[EMAIL PROTECTED]> wrote: > > Any help at all would be much appreciated. > > Submit your submitted command, plus a sample of the > urls in the url file, plus your filter. We can start > from there. > > JohnM > > -- > john mendenhall > [EMAIL PROTECTED] > surf utopia > internet services >
