Thank you very much,
I changed db.max.outlinks.per.page and db.max.anchor.length to 200 and I got
whole web site indexed.
This particular web site has more than 100 outbound links per page.
Michael
----- Original Message -----
From: "Steven Yelton" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages
Is it not catching all the outbound links?
db.max.outlinks.per.page
I think the default is 100. I had to bump it up significantly to index a
reference site...
Steven
Michael Plax wrote:
Hello,
Question summery:
Q: How can I set up crawler in order to index all web site?
I'm trying to run crawl with command from tutorial
1. In urls file I have start page (index.html). 2. In the configuration
file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&
crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
060118 155526 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 63
Number of links: 3906
6. I get less pages than I have expected.
What I did:
0. I read
http://www.mail-archive.com/[email protected]/msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did not appear - I do get that page
indexed
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
060118 162103 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 64
Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?
Thank you
Michael
P.S.
I have attached configuration files
============================
urls
============================
http://www.totallyfurniture.com/index.html
============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/
# skip everything else
-.