Hi all, I tried to invoke a indexing on 4 websites (daily news and articles), what I got are just a scanty of web pages being indexed (compared to if I run crawl, the pages I could index is 10 folds). I dont know what have I don wrong or should I need to configure besides nutch-site.xml (which I copied from nutch-default.xml). I am puzzled thou I have read all the available tutorials. By the way, I also noticed something strange where the crawler tried to fetch robot.txt from each of the websites. Anyway I can disable them, thou I have eliminated all the agents-related parameter in nutch-site.xml.
Thanks in advance. . . . 161658 http.proxy.host = null 060112 161658 http.proxy.port = 8080 060112 161658 http.timeout = 1000000 060112 161658 http.content.limit = 65536 060112 161658 http.agent = NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; [email protected]) 060112 161658 fetcher.server.delay = 5000 060112 161658 http.max.delays = 10 060112 161659 fetching http://www.bernama.com.my/robots.txt 060112 161659 fetching http://www.thestar.com.my/robots.txt 060112 161659 fetching http://www.unimas.my/robots.txt 060112 161659 fetching http://www.nst.com.my/robots.txt 060112 161659 fetched 208 bytes from http://www.unimas.my/robots.txt 060112 161659 fetching http://www.unimas.my/ 060112 161659 fetched 14887 bytes from http://www.unimas.my/ 060112 161659 fetched 204 bytes from http://www.bernama.com.my/robots.txt 060112 161659 fetching http://www.bernama.com.my/ 060112 161659 uncompressing.... 060112 161659 fetched 3438 bytes of compressed content (expanded to 10620 bytes) from http://www.nst.com.my/robots.txt 060112 161659 fetching http://www.nst.com.my/ 060112 161659 fetched 1181 bytes from http://www.bernama.com.my/ 060112 161700 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 060112 161701 uncompressing.... 060112 161701 fetched 11183 bytes of compressed content (expanded to 43846 bytes) from http://www.nst.com.my/ 060112 161703 fetched 1635 bytes from http://www.thestar.com.my/robots.txt 060112 161703 fetching http://www.thestar.com.my/ 060112 161706 fetched 26712 bytes from http://www.thestar.com.my/ 060112 161707 status: segment 20060112161614, 4 pages, 0 errors, 86626 bytes, 9198 ms 060112 161707 status: 0.43487716 pages/s, 73.57748 kb/s, 21656.5 bytes/page Rgds Bong Chih How ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_idv37&alloc_id865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
