[Nutch-general] HELP: Fetch only small number of pages from 4 websites

Chih How Bong Thu, 12 Jan 2006 00:45:09 -0800

Hi all,
  I tried to invoke a indexing on 4 websites (daily news and
articles), what I got are just a scanty of web pages being indexed
(compared to if I run crawl, the pages I could index is 10 folds). I
dont know what have I don wrong or should I need to configure besides
nutch-site.xml (which I copied from nutch-default.xml). I am puzzled
thou I have read all the available tutorials.
  By the way, I also noticed something strange where the crawler tried
to fetch robot.txt from each of the websites. Anyway I can disable
them, thou I have eliminated all the agents-related parameter in
nutch-site.xml.


Thanks in advance.

.
.
.
161658 http.proxy.host = null
060112 161658 http.proxy.port = 8080
060112 161658 http.timeout = 1000000
060112 161658 http.content.limit = 65536
060112 161658 http.agent = NutchCVS/0.7.1 (Nutch;
http://lucene.apache.org/nutch/bot.html;
[email protected])
060112 161658 fetcher.server.delay = 5000
060112 161658 http.max.delays = 10
060112 161659 fetching http://www.bernama.com.my/robots.txt
060112 161659 fetching http://www.thestar.com.my/robots.txt
060112 161659 fetching http://www.unimas.my/robots.txt
060112 161659 fetching http://www.nst.com.my/robots.txt
060112 161659 fetched 208 bytes from http://www.unimas.my/robots.txt
060112 161659 fetching http://www.unimas.my/
060112 161659 fetched 14887 bytes from http://www.unimas.my/
060112 161659 fetched 204 bytes from http://www.bernama.com.my/robots.txt
060112 161659 fetching http://www.bernama.com.my/
060112 161659 uncompressing....
060112 161659 fetched 3438 bytes of compressed content (expanded to
10620 bytes) from http://www.nst.com.my/robots.txt
060112 161659 fetching http://www.nst.com.my/
060112 161659 fetched 1181 bytes from http://www.bernama.com.my/
060112 161700 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060112 161701 uncompressing....
060112 161701 fetched 11183 bytes of compressed content (expanded to
43846 bytes) from http://www.nst.com.my/
060112 161703 fetched 1635 bytes from http://www.thestar.com.my/robots.txt
060112 161703 fetching http://www.thestar.com.my/
060112 161706 fetched 26712 bytes from http://www.thestar.com.my/
060112 161707 status: segment 20060112161614, 4 pages, 0 errors, 86626
bytes, 9198 ms
060112 161707 status: 0.43487716 pages/s, 73.57748 kb/s, 21656.5 bytes/page

Rgds
Bong Chih How


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] HELP: Fetch only small number of pages from 4 websites

Reply via email to