[Nutch-general] Re: Can't index some pages

Steven Yelton Thu, 19 Jan 2006 05:31:35 -0800

Is it not catching all the outbound links?

db.max.outlinks.per.page

I think the default is 100. I had to bump it up significantly to indexa reference site...


Steven

Michael Plax wrote:

Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html).2. In the configuration file conf/crawl-urlfilter.txt domain was changed.

3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
  output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 155526 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 63
 Number of links: 3906
6. I get less pages than I have expected.

What I did:
0. I read http://www.mail-archive.com/[email protected]/msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did  not appear - I do get that page 
indexed
   output:
 $ bin/nutch readdb crawledtottaly/db -stats
 run java in C:\Sun\AppServer\jdk
 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
 060118 162103 No FS indicated, using default:local
 Stats for [EMAIL PROTECTED]
 -------------------------------
 Number of pages: 64
 Number of links: 3906
This page appears in depth 3 from index.html

Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files

============================
urls
============================
http://www.totallyfurniture.com/index.html


============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Can't index some pages

Reply via email to