Re: Newbie Questions: http.max.delays, view fetched page, view link db

Martin Kuen Tue, 29 Jan 2008 07:11:32 -0800

Hi,

On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote:


>
> Hi,
>
> I am new to nutch and I am trying to run a nutch to fetch something from
> specific websites. Currently I am running 0.9.
>
> As I have limited resources, I don't want nutch be too aggressive, so I
> want
> to set some delay, but I am confused with the value of http.max.delays,
> does
> it use milliseconds insteads of seconds? (Some people said it is in 3
> second
> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
>

"http.max.delays" doesn't specify a timespan - read the description more
carefully. I think "fetcher.server.delay" is what you are looking for. It is
the amount of time the fetcher will at least wait until it fetches another
url from the same host. Keep in mind that the fetcher obeys robots.txt files
(by default) - so if a robots.txt file is present the crawling will occur
"polite enough".


> Also, I need to read the fetched page so that I can do some modification
> on
> the html structure for future parsing, where is the files located? Are
> they
> store in pure html or they are breaken down into multiple file? if this is
> not html file, how can I read the fetched page?
>

If you are looking for a way to programmatically read the fetched content (
e.g. html pages) have a look at the IndexReader class.
If you are looking for a way to dump the whole downloaded content to a Text
file or want to see some statistical information about it, try the "readseg"
command.
Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions

>
> And will the cached page losing all the original html attribute when it
> viewed in cached page?
>
The page will be stored character by character, including html tags.

>
> Also, how can I read the link that nutch found and how can I control the
> crawling sequence? (change it to breadth-first search at the top level,
> then
> depth-first one by one)
>
Crawling always occurs breadth-first. If you want fine-grained control over
the crawling sequence you should follow the procedure in the nutch tutorial
for "whole internet crawling". Nevertheless the crawling occurs
breath-first.

>
> Sorry for many questions.


HTH,

Martin

PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
(nice semester abroad . . . hehe ;)


> --
> View this message in context:
> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Reply via email to