And I'm working on a solution to use HBase as backend :) On Tue, Jun 17, 2008 at 8:01 PM, Chris Anderson <[EMAIL PROTECTED]> wrote:
> My team is working on a Streaming.jar for nutch, that output the > crawled pages in a JSON format. Hopefully we'll be able to share it > once we know it is solid. This way you can send the crawled data to > programs written in any language. > > On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou > <[EMAIL PROTECTED]> wrote: > > You can fetch it but it is not pretty. > > > > It is just a SequenceFileInputFormat: > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html > > > > Look in the org.apache.nutch.crawl.Crawl class and specifically how it > uses > > the Indexer. > > > > Kindly > > > > //Marcus > > > > On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[EMAIL PROTECTED]> > > wrote: > > > >> > >> oh, you are right. > >> thanks > >> > >> > >> POIRIER David wrote: > >> > > >> > When executing a crawl, Nutch creates segments, based on the crawel > >> > depth if I'm not mistaking, in which the fetched content is stored. > For > >> > example, if crawling a web site named site-xyz, into the directory > >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the > >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each > >> > segment directory you will find a content directory. > >> > > >> > To be honest, I don't think you can directly access the stored content > >> > found in thoses directories, the idea being to index it and not > >> > necesserely store it. > >> > > >> > David > >> > > >> > > >> > > >> > -----Original Message----- > >> > From: beansproud [mailto:[EMAIL PROTECTED] > >> > Sent: lundi, 16. juin 2008 16:42 > >> > To: [email protected] > >> > Subject: where nutch store crawled data > >> > > >> > > >> > Hi, > >> > I'm fresh for nutch.And when I use nutch for crawling pages.I can > >> > get > >> > the crawled data by using the command : nutch readseg. > >> > My question is can I get the data directly ? I just can't find > where > >> > nutch put them. > >> > Can anybody tell me ? > >> > Thanks very much! > >> > -- > >> > View this message in context: > >> > > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961 > >> > .html > >> > Sent from the Nutch - User mailing list archive at Nabble.com. > >> > > >> > > >> > > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > [EMAIL PROTECTED] > > http://www.tailsweep.com/ > > http://blogg.tailsweep.com/ > > > > > > -- > Chris Anderson > http://jchris.mfdz.com > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/
