Hi all,
  I tried to invoke a indexing on 4 websites (daily news and
articles), what I got are just a scanty of web pages being indexed
(compared to if I run crawl, the pages I could index is 10 folds). I
dont know what have I don wrong or should I need to configure besides
nutch-site.xml (which I copied from nutch-default.xml). I am puzzled
thou I have read all the available tutorials.
  By the way, I also noticed something strange where the crawler tried
to fetch robot.txt from each of the websites. Anyway I can disable
them, thou I have eliminated all the agents-related parameter in
nutch-site.xml.

Thanks in advance.

.
.
.
161658 http.proxy.host = null
060112 161658 http.proxy.port = 8080
060112 161658 http.timeout = 1000000
060112 161658 http.content.limit = 65536
060112 161658 http.agent = NutchCVS/0.7.1 (Nutch;
http://lucene.apache.org/nutch/bot.html;
[email protected])
060112 161658 fetcher.server.delay = 5000
060112 161658 http.max.delays = 10
060112 161659 fetching http://www.bernama.com.my/robots.txt
060112 161659 fetching http://www.thestar.com.my/robots.txt
060112 161659 fetching http://www.unimas.my/robots.txt
060112 161659 fetching http://www.nst.com.my/robots.txt
060112 161659 fetched 208 bytes from http://www.unimas.my/robots.txt
060112 161659 fetching http://www.unimas.my/
060112 161659 fetched 14887 bytes from http://www.unimas.my/
060112 161659 fetched 204 bytes from http://www.bernama.com.my/robots.txt
060112 161659 fetching http://www.bernama.com.my/
060112 161659 uncompressing....
060112 161659 fetched 3438 bytes of compressed content (expanded to
10620 bytes) from http://www.nst.com.my/robots.txt
060112 161659 fetching http://www.nst.com.my/
060112 161659 fetched 1181 bytes from http://www.bernama.com.my/
060112 161700 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060112 161701 uncompressing....
060112 161701 fetched 11183 bytes of compressed content (expanded to
43846 bytes) from http://www.nst.com.my/
060112 161703 fetched 1635 bytes from http://www.thestar.com.my/robots.txt
060112 161703 fetching http://www.thestar.com.my/
060112 161706 fetched 26712 bytes from http://www.thestar.com.my/
060112 161707 status: segment 20060112161614, 4 pages, 0 errors, 86626
bytes, 9198 ms
060112 161707 status: 0.43487716 pages/s, 73.57748 kb/s, 21656.5 bytes/page

Rgds
Bong Chih How

Reply via email to