Hi, thank you.:) Seems I need to write a Java program to write out the file and do the transformation. Another question to the dumped linkdb: I find escaped html appear in the end of the link, is it the fault of the parser (the html most likely not valid, but I really don't need the chunk of the invalid code)? If I want to change the link parser, what do I need to do (especially I prefer the change it by plugins)?
Martin Kuen wrote: > > Hi there, > > On Jan 29, 2008 5:23 PM, Vinci <[EMAIL PROTECTED]> wrote: > >> >> Hi, >> >> Thank you :) >> One more question for the fetched page reading: I prefer I can dump the >> fetched page into a single html file. > > You could modify the Fetcher class (org.apache.nutch.fetch.Fetcher) to > create a seperate file for each downloaded file. > You could modify the SegmentReader class ( > org.apache.nutch.segment.SegmentReader) if you want to do that. > > No other way besides invert the >> inverted file? >> > The index is not inverted if you use the "readseg" command. The fetched > content (e.g html pages) is stored in the "crawl/segments" folder. The > lucene index is stored in "crawl/indexes". This (lucene) index is created > after all crawling has finished. The readseg command (SegmentReader class) > only accesses "crawl/segments", so the lucene index is not touched. lucene > index --> the inverted index > > Best Regards, > > Martin > > >> >> Martin Kuen wrote: >> > >> > Hi, >> > >> > On Jan 29, 2008 11:11 AM, Vinci <[EMAIL PROTECTED]> wrote: >> > >> >> >> >> Hi, >> >> >> >> I am new to nutch and I am trying to run a nutch to fetch something >> from >> >> specific websites. Currently I am running 0.9. >> >> >> >> As I have limited resources, I don't want nutch be too aggressive, so >> I >> >> want >> >> to set some delay, but I am confused with the value of >> http.max.delays, >> >> does >> >> it use milliseconds insteads of seconds? (Some people said it is in 3 >> >> second >> >> by default, but I see it is 1000 in crawl-tool.xml in nutch-0.9) >> >> >> > >> > "http.max.delays" doesn't specify a timespan - read the description >> more >> > carefully. I think "fetcher.server.delay" is what you are looking for. >> It >> > is >> > the amount of time the fetcher will at least wait until it fetches >> another >> > url from the same host. Keep in mind that the fetcher obeys robots.txt >> > files >> > (by default) - so if a robots.txt file is present the crawling will >> occur >> > "polite enough". >> > >> > >> >> Also, I need to read the fetched page so that I can do some >> modification >> >> on >> >> the html structure for future parsing, where is the files located? Are >> >> they >> >> store in pure html or they are breaken down into multiple file? if >> this >> >> is >> >> not html file, how can I read the fetched page? >> >> >> > >> > If you are looking for a way to programmatically read the fetched >> content >> > ( >> > e.g. html pages) have a look at the IndexReader class. >> > If you are looking for a way to dump the whole downloaded content to a >> > Text >> > file or want to see some statistical information about it, try the >> > "readseg" >> > command. >> > Check out this link: http://wiki.apache.org/nutch/08CommandLineOptions >> > >> >> >> >> And will the cached page losing all the original html attribute when >> it >> >> viewed in cached page? >> >> >> > The page will be stored character by character, including html tags. >> > >> >> >> >> Also, how can I read the link that nutch found and how can I control >> the >> >> crawling sequence? (change it to breadth-first search at the top >> level, >> >> then >> >> depth-first one by one) >> >> >> > Crawling always occurs breadth-first. If you want fine-grained control >> > over >> > the crawling sequence you should follow the procedure in the nutch >> > tutorial >> > for "whole internet crawling". Nevertheless the crawling occurs >> > breath-first. >> > >> >> >> >> Sorry for many questions. >> > >> > >> > HTH, >> > >> > Martin >> > >> > PS: polyu.edu.hk . . . greetings to the HK Polytechnic University . . . >> > (nice semester abroad . . . hehe ;) >> > >> > >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15156228.html >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15163086.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Newbie-Questions%3A-http.max.delays%2C-view-fetched-page%2C-view-link-db-tp15156228p15175746.html Sent from the Nutch - User mailing list archive at Nabble.com.