Re: Nutch 0.9 mysterious failure to crawl sites (stopping at depth=0)

Jiaqi Tan Wed, 20 Feb 2008 14:26:56 -0800

$ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3

URLS:
http://node5:8080/docs/
http://node5:8080/throughmyeyes/
http://node5:8080/docs2/
http://node5:8080/throughmyeyes2/
http://node5:8080/docs3/
http://node5:8080/throughmyeyes3/
http://node5:8080/empty-1.html
http://node5:8080/empty-2.html
http://node5:8080/empty-3.html
http://node5:8080/empty-4.html
http://node5:8080/empty-5.html
http://node5:8080/empty-6.html
http://node5:8080/empty-7.html
http://node5:8080/empty-8.html
http://node5:8080/empty-9.html
http://node5:8080/empty-10.html
http://node5:8080/empty-11.html
http://node5:8080/empty-12.html
http://node5:8080/empty-13.html
http://node5:8080/empty-14.html
http://node5:8080/empty-15.html
http://node5:8080/empty-16.html
http://node5:8080/empty-17.html
http://node5:8080/empty-18.html
http://node5:8080/empty-19.html
http://node5:8080/empty-20.html
http://node5:8080/empty-21.html
http://node5:8080/empty-22.html
http://node5:8080/empty-23.html
http://node5:8080/empty-24.html
http://node5:8080/empty-25.html
http://node5:8080/empty-26.html
http://node5:8080/empty-27.html
http://node5:8080/empty-28.html
http://node5:8080/empty-29.html
http://node5:8080/empty-30.html
http://node5:8080/empty-31.html
http://node5:8080/empty-32.html
http://node5:8080/empty-33.html
http://node5:8080/empty-34.html
http://node5:8080/empty-35.html
http://node5:8080/empty-36.html
http://node5:8080/empty-37.html
http://node5:8080/empty-38.html
http://node5:8080/empty-39.html
http://node5:8080/empty-40.html

conf/crawl-urlfilter.txt (comments removed):
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
[EMAIL PROTECTED]
-.*(/.+?)/.*?\1/.*?\1/
+^http://node[0-9].vanilla7-([a-z0-9]*\.)*
+^http://node[0-9].vanilla7pc600-([a-z0-9]*\.)*
+^http://node[0-9]:8080/
+^http://([a-z0-9]*\.)*apache.org

(also tried this with '+*', '+.', didn't work either)

nutch-site.xml:
<property>
        <name>http.agent.name</name>
        <value>NutchCrawler</value>
</property>

<property>
        <name>http.agent.version</name>
        <value>0.9</value>
</property>

nutch-default.xml:
<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value>ExperimentalCrawler</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://lucene.apache.org/nutch/</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>[EMAIL PROTECTED]</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>Nutch-0.9</value>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
# (also tried this with urlfilter-crawl included, didn't work either)
</property>

<property>
  <name>plugin.excludes</name>
  <value></value>
</property>

regex-urlfilter.txt:
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

On Wed, Feb 20, 2008 at 5:20 PM, John Mendenhall <[EMAIL PROTECTED]> wrote:
> > Any help at all would be much appreciated.
>
>  Submit your submitted command, plus a sample of the
>  urls in the url file, plus your filter.  We can start
>  from there.
>
>  JohnM
>
>  --
>  john mendenhall
>  [EMAIL PROTECTED]
>  surf utopia
>  internet services
>

Re: Nutch 0.9 mysterious failure to crawl sites (stopping at depth=0)

Reply via email to