Doug, would it make sense to print a LOG.info() message every time the fetcher bumps into one of these "db.max" limits? This would help users find out when they need to adjust their configuration.

I can prepare a patch if it seems sensible.

--Matt

On Jan 19, 2006, at 5:34 PM, Michael Plax wrote:

Thank you very much,

I changed db.max.outlinks.per.page and db.max.anchor.length to 200 and I got whole web site indexed.
This particular web site has more than 100 outbound links per page.

Michael

----- Original Message ----- From: "Steven Yelton" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages


Is it not catching all the outbound links?

db.max.outlinks.per.page

I think the default is 100. I had to bump it up significantly to index a reference site...

Steven

Michael Plax wrote:

Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed. 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
  output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 155526 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 63
 Number of links: 3906
6. I get less pages than I have expected.

What I did:
0. I read http://www.mail-archive.com/nutch- [EMAIL PROTECTED]/msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did not appear - I do get that page indexed
   output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 162103 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 64
 Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files

============================
urls
============================
http://www.totallyfurniture.com/index.html


============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz| rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.



--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to