Thank you very much,
I changed db.max.outlinks.per.page and db.max.anchor.length to 200 and I got
whole web site indexed.
This particular web site has more than 100 outbound links per page.
Michael
----- Original Message -----
From: "Steven Yelton" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages
Is it not catching all the outbound links?
db.max.outlinks.per.page
I think the default is 100. I had to bump it up significantly to index a
reference site...
Steven
Michael Plax wrote:
Hello,
Question summery:
Q: How can I set up crawler in order to index all web site?
I'm trying to run crawl with command from tutorial
1. In urls file I have start page (index.html). 2. In the configuration
file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&
crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
060118 155526 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 63
Number of links: 3906
6. I get less pages than I have expected.
What I did:
0. I read
http://www.mail-archive.com/[email protected]/msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did not appear - I do get that page
indexed
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
060118 162103 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 64
Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?
Thank you
Michael
P.S.
I have attached configuration files
============================
urls
============================
http://www.totallyfurniture.com/index.html
============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/
# skip everything else
-.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general