Hi,

Please specify what you were doing? i.e. did you run the crawl tool?
what was the -depth value?

or did you use the inject and than generate and fetch.

please elaborate a little.

G.

On Thu, 2006-01-12 at 16:44 +0800, Chih How Bong wrote:
> Hi all,
>   I tried to invoke a indexing on 4 websites (daily news and
> articles), what I got are just a scanty of web pages being indexed
> (compared to if I run crawl, the pages I could index is 10 folds). I
> dont know what have I don wrong or should I need to configure besides
> nutch-site.xml (which I copied from nutch-default.xml). I am puzzled
> thou I have read all the available tutorials.
>   By the way, I also noticed something strange where the crawler tried
> to fetch robot.txt from each of the websites. Anyway I can disable
> them, thou I have eliminated all the agents-related parameter in
> nutch-site.xml.
> 
> Thanks in advance.
> 
> .
> .
> .
> 161658 http.proxy.host = null
> 060112 161658 http.proxy.port = 8080
> 060112 161658 http.timeout = 1000000
> 060112 161658 http.content.limit = 65536
> 060112 161658 http.agent = NutchCVS/0.7.1 (Nutch;
> http://lucene.apache.org/nutch/bot.html;
> [email protected])
> 060112 161658 fetcher.server.delay = 5000
> 060112 161658 http.max.delays = 10
> 060112 161659 fetching http://www.bernama.com.my/robots.txt
> 060112 161659 fetching http://www.thestar.com.my/robots.txt
> 060112 161659 fetching http://www.unimas.my/robots.txt
> 060112 161659 fetching http://www.nst.com.my/robots.txt
> 060112 161659 fetched 208 bytes from http://www.unimas.my/robots.txt
> 060112 161659 fetching http://www.unimas.my/
> 060112 161659 fetched 14887 bytes from http://www.unimas.my/
> 060112 161659 fetched 204 bytes from http://www.bernama.com.my/robots.txt
> 060112 161659 fetching http://www.bernama.com.my/
> 060112 161659 uncompressing....
> 060112 161659 fetched 3438 bytes of compressed content (expanded to
> 10620 bytes) from http://www.nst.com.my/robots.txt
> 060112 161659 fetching http://www.nst.com.my/
> 060112 161659 fetched 1181 bytes from http://www.bernama.com.my/
> 060112 161700 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060112 161701 uncompressing....
> 060112 161701 fetched 11183 bytes of compressed content (expanded to
> 43846 bytes) from http://www.nst.com.my/
> 060112 161703 fetched 1635 bytes from http://www.thestar.com.my/robots.txt
> 060112 161703 fetching http://www.thestar.com.my/
> 060112 161706 fetched 26712 bytes from http://www.thestar.com.my/
> 060112 161707 status: segment 20060112161614, 4 pages, 0 errors, 86626
> bytes, 9198 ms
> 060112 161707 status: 0.43487716 pages/s, 73.57748 kb/s, 21656.5 bytes/page
> 
> Rgds
> Bong Chih How
> 




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to