Re: Newbie Questions: http.max.delays, view fetched page, view link db

Vinci Tue, 29 Jan 2008 08:24:27 -0800

Hi,

Thank you :) 
One more question for the fetched page reading: I prefer I can dump the
fetched page into a single html file. No other way besides invert the
inverted file?



Martin Kuen wrote:
> 
> Hi,
> 
> On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi,
>>
>> I am new to nutch and I am trying to run a nutch to fetch something from
>> specific websites. Currently I am running 0.9.
>>
>> As I have limited resources, I don't want nutch be too aggressive, so I
>> want
>> to set some delay, but I am confused with the value of http.max.delays,
>> does
>> it use milliseconds insteads of seconds? (Some people said it is in 3
>> second
>> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
>>
> 
> "http.max.delays" doesn't specify a timespan - read the description more
> carefully. I think "fetcher.server.delay" is what you are looking for. It
> is
> the amount of time the fetcher will at least wait until it fetches another
> url from the same host. Keep in mind that the fetcher obeys robots.txt
> files
> (by default) - so if a robots.txt file is present the crawling will occur
> "polite enough".
> 
> 
>> Also, I need to read the fetched page so that I can do some modification
>> on
>> the html structure for future parsing, where is the files located? Are
>> they
>> store in pure html or they are breaken down into multiple file? if this
>> is
>> not html file, how can I read the fetched page?
>>
> 
> If you are looking for a way to programmatically read the fetched content
> (
> e.g. html pages) have a look at the IndexReader class.
> If you are looking for a way to dump the whole downloaded content to a
> Text
> file or want to see some statistical information about it, try the
> "readseg"
> command.
> Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
> 
>>
>> And will the cached page losing all the original html attribute when it
>> viewed in cached page?
>>
> The page will be stored character by character, including html tags.
> 
>>
>> Also, how can I read the link that nutch found and how can I control the
>> crawling sequence? (change it to breadth-first search at the top level,
>> then
>> depth-first one by one)
>>
> Crawling always occurs breadth-first. If you want fine-grained control
> over
> the crawling sequence you should follow the procedure in the nutch
> tutorial
> for "whole internet crawling". Nevertheless the crawling occurs
> breath-first.
> 
>>
>> Sorry for many questions.
> 
> 
> HTH,
> 
> Martin
> 
> PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
> (nice semester abroad . . . hehe ;)
> 
> 
>> --
>> View this message in context:
>> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Reply via email to