Hello, Question summery: Q: How can I set up crawler in order to index all web site?
I'm trying to run crawl with command from tutorial 1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed. 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 >& crawl.log 4. Crawling is finished 5. I run: bin/nutch readdb crawled/db -stats output: $ bin/nutch readdb crawledtottaly/db -stats run java in C:\Sun\AppServer\jdk 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml 060118 155526 No FS indicated, using default:local Stats for [EMAIL PROTECTED] ------------------------------- Number of pages: 63 Number of links: 3906 6. I get less pages than I have expected. What I did: 0. I read http://www.mail-archive.com/[email protected]/msg02458.html 1. I changed the depth to 10,100, 1000- same results 2. I changed start page to page that did not appear - I do get that page indexed output: $ bin/nutch readdb crawledtottaly/db -stats run java in C:\Sun\AppServer\jdk 060118 162103 parsing file:/C:/nutch/conf/nutch-default.xml 060118 162103 parsing file:/C:/nutch/conf/nutch-site.xml 060118 162103 No FS indicated, using default:local Stats for [EMAIL PROTECTED] ------------------------------- Number of pages: 64 Number of links: 3906 This page appears in depth 3 from index.html Q: How can I set up crawler in order to index all web site? Thank you Michael P.S. I have attached configuration files ============================ urls ============================ http://www.totallyfurniture.com/index.html ============================ crawl-url-filter.txt ============================ # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*totallyfurniture.com/ +^http://([a-z0-9]*\.)*yahoo.net/ # skip everything else -.
