[Nutch-general] Re: Can't index some pages

Matt Kangas Thu, 19 Jan 2006 14:57:15 -0800

Doug, would it make sense to print a LOG.info() message every timethe fetcher bumps into one of these "db.max" limits? This would helpusers find out when they need to adjust their configuration.


I can prepare a patch if it seems sensible.


--Matt

On Jan 19, 2006, at 5:34 PM, Michael Plax wrote:

Thank you very much,

I changed db.max.outlinks.per.page and db.max.anchor.length to 200and I got whole web site indexed.

This particular web site has more than 100 outbound links per page.

Michael

----- Original Message ----- From: "Steven Yelton"<[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Thursday, January 19, 2006 5:29 AM
Subject: Re: Can't index some pages

Is it not catching all the outbound links?

db.max.outlinks.per.page

I think the default is 100. I had to bump it up significantly toindex a reference site...


Steven

Michael Plax wrote:

Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 2. In theconfiguration file conf/crawl-urlfilter.txt domain was changed.3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >&crawl.log

4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
  output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 155526 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 63
 Number of links: 3906
6. I get less pages than I have expected.

What I did:

0. I read http://www.mail-archive.com/nutch-[EMAIL PROTECTED]/msg02458.html

1. I changed the depth to 10,100, 1000- same results

2. I changed start page to page that did not appear - I do getthat page indexed

   output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 162103 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 64
 Number of links: 3906
This page appears in depth 3 from index.html
Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files

============================
urls
============================
http://www.totallyfurniture.com/index.html


============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$


# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.


--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Can't index some pages

Reply via email to