Can't index some pages

Michael Plax Wed, 18 Jan 2006 16:44:59 -0800

Hello,

Question summery:
Q: How can I set up crawler in order to index all web site?


I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 
2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  -------------------------------
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.

What I did:
0. I read http://www.mail-archive.com/[email protected]/msg02458.html
1. I changed the depth to 10,100, 1000- same results
2. I changed start page to page that did  not appear - I do get that page 
indexed
    output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 162103 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  -------------------------------
  Number of pages: 64
  Number of links: 3906
This page appears in depth 3 from index.html
 
Q: How can I set up crawler in order to index all web site?

Thank you
Michael

P.S.
I have attached configuration files

============================
urls
============================
http://www.totallyfurniture.com/index.html


============================
crawl-url-filter.txt
============================
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]


# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*totallyfurniture.com/
+^http://([a-z0-9]*\.)*yahoo.net/


# skip everything else
-.

Can't index some pages

Reply via email to