And I'm working on a solution to use HBase as backend :)

On Tue, Jun 17, 2008 at 8:01 PM, Chris Anderson <[EMAIL PROTECTED]> wrote:

> My team is working on a Streaming.jar for nutch, that output the
> crawled pages in a JSON format. Hopefully we'll be able to share it
> once we know it is solid. This way you can send the crawled data to
> programs written in any language.
>
> On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou
> <[EMAIL PROTECTED]> wrote:
> > You can fetch it but it is not pretty.
> >
> > It is just a SequenceFileInputFormat:
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
> >
> > Look in the org.apache.nutch.crawl.Crawl class and specifically how it
> uses
> > the Indexer.
> >
> > Kindly
> >
> > //Marcus
> >
> > On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[EMAIL PROTECTED]>
> > wrote:
> >
> >>
> >> oh, you are right.
> >> thanks
> >>
> >>
> >> POIRIER David wrote:
> >> >
> >> > When executing a crawl, Nutch creates segments, based on the crawel
> >> > depth if I'm not mistaking, in which the fetched content is stored.
> For
> >> > example, if crawling a web site named site-xyz, into the directory
> >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> >> > segment directory you will find a content directory.
> >> >
> >> > To be honest, I don't think you can directly access the stored content
> >> > found in thoses directories, the idea being to index it and not
> >> > necesserely store it.
> >> >
> >> > David
> >> >
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: beansproud [mailto:[EMAIL PROTECTED]
> >> > Sent: lundi, 16. juin 2008 16:42
> >> > To: [email protected]
> >> > Subject: where nutch store crawled data
> >> >
> >> >
> >> > Hi,
> >> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> >> > get
> >> > the crawled data by using the command : nutch readseg.
> >> >     My question is can I get the data directly ? I just can't find
> where
> >> > nutch put them.
> >> >     Can anybody tell me ?
> >> >     Thanks very much!
> >> > --
> >> > View this message in context:
> >> >
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> >> > .html
> >> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >> >
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > [EMAIL PROTECTED]
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
>
>
> --
> Chris Anderson
> http://jchris.mfdz.com
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Reply via email to