My team is working on a Streaming.jar for nutch, that output the crawled pages in a JSON format. Hopefully we'll be able to share it once we know it is solid. This way you can send the crawled data to programs written in any language.
On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou <[EMAIL PROTECTED]> wrote: > You can fetch it but it is not pretty. > > It is just a SequenceFileInputFormat: > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html > > Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses > the Indexer. > > Kindly > > //Marcus > > On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[EMAIL PROTECTED]> > wrote: > >> >> oh, you are right. >> thanks >> >> >> POIRIER David wrote: >> > >> > When executing a crawl, Nutch creates segments, based on the crawel >> > depth if I'm not mistaking, in which the fetched content is stored. For >> > example, if crawling a web site named site-xyz, into the directory >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each >> > segment directory you will find a content directory. >> > >> > To be honest, I don't think you can directly access the stored content >> > found in thoses directories, the idea being to index it and not >> > necesserely store it. >> > >> > David >> > >> > >> > >> > -----Original Message----- >> > From: beansproud [mailto:[EMAIL PROTECTED] >> > Sent: lundi, 16. juin 2008 16:42 >> > To: [email protected] >> > Subject: where nutch store crawled data >> > >> > >> > Hi, >> > I'm fresh for nutch.And when I use nutch for crawling pages.I can >> > get >> > the crawled data by using the command : nutch readseg. >> > My question is can I get the data directly ? I just can't find where >> > nutch put them. >> > Can anybody tell me ? >> > Thanks very much! >> > -- >> > View this message in context: >> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961 >> > .html >> > Sent from the Nutch - User mailing list archive at Nabble.com. >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [EMAIL PROTECTED] > http://www.tailsweep.com/ > http://blogg.tailsweep.com/ > -- Chris Anderson http://jchris.mfdz.com
