Oh sorry just saw CrawlDbReader which have different methods, one in particular for retrieving content based on a url.
//Marcus On Tue, Jun 17, 2008 at 7:57 PM, Marcus Herou <[EMAIL PROTECTED]> wrote: > You can fetch it but it is not pretty. > > It is just a SequenceFileInputFormat: > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html > > Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses > the Indexer. > > Kindly > > //Marcus > > > On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[EMAIL PROTECTED]> > wrote: > >> >> oh, you are right. >> thanks >> >> >> POIRIER David wrote: >> > >> > When executing a crawl, Nutch creates segments, based on the crawel >> > depth if I'm not mistaking, in which the fetched content is stored. For >> > example, if crawling a web site named site-xyz, into the directory >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each >> > segment directory you will find a content directory. >> > >> > To be honest, I don't think you can directly access the stored content >> > found in thoses directories, the idea being to index it and not >> > necesserely store it. >> > >> > David >> > >> > >> > >> > -----Original Message----- >> > From: beansproud [mailto:[EMAIL PROTECTED] >> > Sent: lundi, 16. juin 2008 16:42 >> > To: [email protected] >> > Subject: where nutch store crawled data >> > >> > >> > Hi, >> > I'm fresh for nutch.And when I use nutch for crawling pages.I can >> > get >> > the crawled data by using the command : nutch readseg. >> > My question is can I get the data directly ? I just can't find where >> > nutch put them. >> > Can anybody tell me ? >> > Thanks very much! >> > -- >> > View this message in context: >> > >> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961 >> > .html >> > Sent from the Nutch - User mailing list archive at Nabble.com. >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [EMAIL PROTECTED] > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/
