Doug, would it make sense to print a LOG.info() message every time
the fetcher bumps into one of these "db.max" limits? This would help
users find out when they need to adjust their configuration.
I can prepare a patch if it seems sensible.
--Matt
On Jan 19, 2006, at 5:34 PM, Michael Plax wrote:
Thank you very much,
I changed db.max.outlinks.per.page and db.max.anchor.length to 200
and I got whole web site indexed.
This particular web site has more than 100 outbound links per page.
Michael
----- Original Message ----- From: "Steven Yelton"
<[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages
Is it not catching all the outbound links?
db.max.outlinks.per.page
I think the default is 100. I had to bump it up significantly to
index a reference site...
Steven
Michael Plax wrote:
Hello,
Question summery:
Q: How can I set up crawler in order to index all web site?
I'm trying to run crawl with command from tutorial
1. In urls file I have start page (index.html). 2. In the
configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&
crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
060118 155526 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 63
Number of links: 3906
6. I get less pages than I have expected.
What I did:
0. I read http://www.mail-archive.com/nutch-
[EMAIL PROTECTED]/msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did not appear - I do get
that page indexed
output:
$ bin/nutch readdb crawledtottaly/db -stats
run java in C:\Sun\AppServer\jdk
060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
060118 162103 No FS indicated, using default:local
Stats for [EMAIL PROTECTED]
-------------------------------
Number of pages: 64
Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?
Thank you
Michael
P.S.
I have attached configuration files
============================
urls
============================
http://www.totallyfurniture.com/index.html
============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|
rpm|tgz|mov|MOV|exe|png|PNG)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/
# skip everything else
-.
--
Matt Kangas / [EMAIL PROTECTED]
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general