Hi, Thank you :) One more question for the fetched page reading: I prefer I can dump the fetched page into a single html file. No other way besides invert the inverted file?
Martin Kuen wrote: > > Hi, > > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi, >> >> I am new to nutch and I am trying to run a nutch to fetch something from >> specific websites. Currently I am running 0.9. >> >> As I have limited resources, I don't want nutch be too aggressive, so I >> want >> to set some delay, but I am confused with the value of http.max.delays, >> does >> it use milliseconds insteads of seconds? (Some people said it is in 3 >> second >> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) >> > > "http.max.delays" doesn't specify a timespan - read the description more > carefully. I think "fetcher.server.delay" is what you are looking for. It > is > the amount of time the fetcher will at least wait until it fetches another > url from the same host. Keep in mind that the fetcher obeys robots.txt > files > (by default) - so if a robots.txt file is present the crawling will occur > "polite enough". > > >> Also, I need to read the fetched page so that I can do some modification >> on >> the html structure for future parsing, where is the files located? Are >> they >> store in pure html or they are breaken down into multiple file? if this >> is >> not html file, how can I read the fetched page? >> > > If you are looking for a way to programmatically read the fetched content > ( > e.g. html pages) have a look at the IndexReader class. > If you are looking for a way to dump the whole downloaded content to a > Text > file or want to see some statistical information about it, try the > "readseg" > command. > Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions > >> >> And will the cached page losing all the original html attribute when it >> viewed in cached page? >> > The page will be stored character by character, including html tags. > >> >> Also, how can I read the link that nutch found and how can I control the >> crawling sequence? (change it to breadth-first search at the top level, >> then >> depth-first one by one) >> > Crawling always occurs breadth-first. If you want fine-grained control > over > the crawling sequence you should follow the procedure in the nutch > tutorial > for "whole internet crawling". Nevertheless the crawling occurs > breath-first. > >> >> Sorry for many questions. > > > HTH, > > Martin > > PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . > (nice semester abroad . . . hehe ;) > > >> -- >> View this message in context: >> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html Sent from the Nutch - User mailing list archive at Nabble.com.