[Nutch-general] Re: HELP: Fetch only small number of pages from 4 websites

Chih How Bong Thu, 12 Jan 2006 17:19:02 -0800

Hi,
I am not using crawl tool. I am fetching whole whole as it written in the
tutorial.
I first created the db, inject 4 urls from a urlsfile, then generate the
segments and performing fetching. Before I did the fetching, I created
nutch-site.xml from nutch-default.xml (copy). I put down the value of 3 for
the depth value in the nutch-site.xml and only then I perform fetch where it
returned only <100 pages.


My second try was removing all the robot-specified parameters in the
nutch-site.xml. It seemed returning the same result to me.

I hope it's clear to you. Thanks in advance.

Rgds
Bong Chih How



On 1/12/06, Gal Nitzan <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> Please specify what you were doing? i.e. did you run the crawl tool?
> what was the -depth value?
>
> or did you use the inject and than generate and fetch.
>
> please elaborate a little.
>
> G.
>
> On Thu, 2006-01-12 at 16:44 +0800, Chih How Bong wrote:
> > Hi all,
> >   I tried to invoke a indexing on 4 websites (daily news and
> > articles), what I got are just a scanty of web pages being indexed
> > (compared to if I run crawl, the pages I could index is 10 folds). I
> > dont know what have I don wrong or should I need to configure besides
> > nutch-site.xml (which I copied from nutch-default.xml). I am puzzled
> > thou I have read all the available tutorials.
> >   By the way, I also noticed something strange where the crawler tried
> > to fetch robot.txt from each of the websites. Anyway I can disable
> > them, thou I have eliminated all the agents-related parameter in
> > nutch-site.xml.
> >
> > Thanks in advance.
> >
> > .
> > .
> > .
> > 161658 http.proxy.host = null
> > 060112 161658 http.proxy.port = 8080
> > 060112 161658 http.timeout = 1000000
> > 060112 161658 http.content.limit = 65536
> > 060112 161658 http.agent = NutchCVS/0.7.1 (Nutch;
> > http://lucene.apache.org/nutch/bot.html;
> > [email protected])
> > 060112 161658 fetcher.server.delay = 5000
> > 060112 161658 http.max.delays = 10
> > 060112 161659 fetching http://www.bernama.com.my/robots.txt
> > 060112 161659 fetching http://www.thestar.com.my/robots.txt
> > 060112 161659 fetching http://www.unimas.my/robots.txt
> > 060112 161659 fetching http://www.nst.com.my/robots.txt
> > 060112 161659 fetched 208 bytes from http://www.unimas.my/robots.txt
> > 060112 161659 fetching http://www.unimas.my/
> > 060112 161659 fetched 14887 bytes from http://www.unimas.my/
> > 060112 161659 fetched 204 bytes from
> http://www.bernama.com.my/robots.txt
> > 060112 161659 fetching http://www.bernama.com.my/
> > 060112 161659 uncompressing....
> > 060112 161659 fetched 3438 bytes of compressed content (expanded to
> > 10620 bytes) from http://www.nst.com.my/robots.txt
> > 060112 161659 fetching http://www.nst.com.my/
> > 060112 161659 fetched 1181 bytes from http://www.bernama.com.my/
> > 060112 161700 Using URL normalizer:
> org.apache.nutch.net.BasicUrlNormalizer
> > 060112 161701 uncompressing....
> > 060112 161701 fetched 11183 bytes of compressed content (expanded to
> > 43846 bytes) from http://www.nst.com.my/
> > 060112 161703 fetched 1635 bytes from
> http://www.thestar.com.my/robots.txt
> > 060112 161703 fetching http://www.thestar.com.my/
> > 060112 161706 fetched 26712 bytes from http://www.thestar.com.my/
> > 060112 161707 status: segment 20060112161614, 4 pages, 0 errors, 86626
> > bytes, 9198 ms
> > 060112 161707 status: 0.43487716 pages/s, 73.57748 kb/s, 21656.5bytes/page
> >
> > Rgds
> > Bong Chih How
> >
>
>
>

[Nutch-general] Re: HELP: Fetch only small number of pages from 4 websites

Reply via email to