Re: Can't index some pages

Matt Kangas Thu, 19 Jan 2006 14:56:15 -0800

Doug, would it make sense to print a LOG.info() message every timethe fetcher bumps into one of these "db.max" limits? This would helpusers find out when they need to adjust their configuration.


I can prepare a patch if it seems sensible.


--Matt

On Jan 19, 2006, at 5:34 PM, Michael Plax wrote:

Thank you very much,

I changed db.max.outlinks.per.page and db.max.anchor.length to 200and I got whole web site indexed.

This particular web site has more than 100 outbound links per page.

Michael

----- Original Message ----- From: "Steven Yelton"<[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages

Is it not catching all the outbound links?

db.max.outlinks.per.page

I think the default is 100. I had to bump it up significantly toindex a reference site...


Steven

Michael Plax wrote:

Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 2. In theconfiguration file conf/crawl-urlfilter.txt domain was changed.3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&crawl.log

4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
  output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 155526 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 63
 Number of links: 3906
6. I get less pages than I have expected.

What I did:

0. I read http://www.mail-archive.com/nutch-[EMAIL PROTECTED]/msg02458.html

1. I changed the depth to 10,100, 1000- same results

2. I changed start page to page that did not appear - I do getthat page indexed

   output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 162103 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 64
 Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files

============================
urls
============================
http://www.totallyfurniture.com/index.html


============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$


# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.


--
Matt Kangas / [EMAIL PROTECTED]

Re: Can't index some pages

Reply via email to