Re: Newbie Questions: http.max.delays, view fetched page, view link db

Vinci Tue, 29 Jan 2008 21:22:13 -0800

Hi,

thank you.:)
Seems I need to write a Java program to write out the file and do the
transformation.
Another question to the dumped linkdb: I find escaped html appear in the end
of the link, is it the fault of the parser (the html most likely not valid,
but I really don't need the chunk of the invalid code)? 
If I want to change the link parser, what do I need to do (especially I
prefer the change it by plugins)?



Martin Kuen wrote:
> 
> Hi there,
> 
> On Jan 29, 2008 5:23 PM, Vinci <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi,
>>
>> Thank you :)
>> One more question for the fetched page reading: I prefer I can dump the
>> fetched page into a single html file.
> 
> You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to
> create a seperate file for each downloaded file.
> You could modify the SegmentReader class (
> org.apache.nutch.segment.SegmentReader) if you want to do that.
> 
> No other way besides invert the
>> inverted file?
>>
> The index is not inverted if you use the "readseg" command. The fetched
> content (e.g html pages) is stored in the "crawl/segments" folder. The
> lucene index is stored in "crawl/indexes". This (lucene) index is created
> after all crawling has finished. The readseg command (SegmentReader class)
> only accesses "crawl/segments", so the lucene index is not touched. lucene
> index --> the inverted index
> 
> Best Regards,
> 
> Martin
> 
> 
>>
>> Martin Kuen wrote:
>> >
>> > Hi,
>> >
>> > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I am new to nutch and I am trying to run a nutch to fetch something
>> from
>> >> specific websites. Currently I am running 0.9.
>> >>
>> >> As I have limited resources, I don't want nutch be too aggressive, so
>> I
>> >> want
>> >> to set some delay, but I am confused with the value of
>> http.max.delays,
>> >> does
>> >> it use milliseconds insteads of seconds? (Some people said it is in 3
>> >> second
>> >> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9)
>> >>
>> >
>> > "http.max.delays" doesn't specify a timespan - read the description
>> more
>> > carefully. I think "fetcher.server.delay" is what you are looking for.
>> It
>> > is
>> > the amount of time the fetcher will at least wait until it fetches
>> another
>> > url from the same host. Keep in mind that the fetcher obeys robots.txt
>> > files
>> > (by default) - so if a robots.txt file is present the crawling will
>> occur
>> > "polite enough".
>> >
>> >
>> >> Also, I need to read the fetched page so that I can do some
>> modification
>> >> on
>> >> the html structure for future parsing, where is the files located? Are
>> >> they
>> >> store in pure html or they are breaken down into multiple file? if
>> this
>> >> is
>> >> not html file, how can I read the fetched page?
>> >>
>> >
>> > If you are looking for a way to programmatically read the fetched
>> content
>> > (
>> > e.g. html pages) have a look at the IndexReader class.
>> > If you are looking for a way to dump the whole downloaded content to a
>> > Text
>> > file or want to see some statistical information about it, try the
>> > "readseg"
>> > command.
>> > Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions
>> >
>> >>
>> >> And will the cached page losing all the original html attribute when
>> it
>> >> viewed in cached page?
>> >>
>> > The page will be stored character by character, including html tags.
>> >
>> >>
>> >> Also, how can I read the link that nutch found and how can I control
>> the
>> >> crawling sequence? (change it to breadth-first search at the top
>> level,
>> >> then
>> >> depth-first one by one)
>> >>
>> > Crawling always occurs breadth-first. If you want fine-grained control
>> > over
>> > the crawling sequence you should follow the procedure in the nutch
>> > tutorial
>> > for "whole internet crawling". Nevertheless the crawling occurs
>> > breath-first.
>> >
>> >>
>> >> Sorry for many questions.
>> >
>> >
>> > HTH,
>> >
>> > Martin
>> >
>> > PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . .
>> > (nice semester abroad . . . hehe ;)
>> >
>> >
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Newbie Questions: http.max.delays, view fetched page, view link db

Reply via email to