Re: where nutch store crawled data

Otis Gospodnetic Tue, 17 Jun 2008 21:50:20 -0700

Hi,

Both of you should open some JIRA issues and upload your patches there as you 
progress, so others can see the direction you are headed and make suggestions 
when appropriate.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Marcus Herou <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Tuesday, June 17, 2008 2:03:43 PM
> Subject: Re: where nutch store crawled data
> 
> And I'm working on a solution to use HBase as backend :)
> 
> On Tue, Jun 17, 2008 at 8:01 PM, Chris Anderson wrote:
> 
> > My team is working on a Streaming.jar for nutch, that output the
> > crawled pages in a JSON format. Hopefully we'll be able to share it
> > once we know it is solid. This way you can send the crawled data to
> > programs written in any language.
> >
> > On Tue, Jun 17, 2008 at 10:57 AM, Marcus Herou
> > wrote:
> > > You can fetch it but it is not pretty.
> > >
> > > It is just a SequenceFileInputFormat:
> > >
> > 
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html
> > >
> > > Look in the org.apache.nutch.crawl.Crawl class and specifically how it
> > uses
> > > the Indexer.
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > > On Tue, Jun 17, 2008 at 3:57 PM, beansproud 
> > > wrote:
> > >
> > >>
> > >> oh, you are right.
> > >> thanks
> > >>
> > >>
> > >> POIRIER David wrote:
> > >> >
> > >> > When executing a crawl, Nutch creates segments, based on the crawel
> > >> > depth if I'm not mistaking, in which the fetched content is stored.
> > For
> > >> > example, if crawling a web site named site-xyz, into the directory
> > >> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > >> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > >> > segment directory you will find a content directory.
> > >> >
> > >> > To be honest, I don't think you can directly access the stored content
> > >> > found in thoses directories, the idea being to index it and not
> > >> > necesserely store it.
> > >> >
> > >> > David
> > >> >
> > >> >
> > >> >
> > >> > -----Original Message-----
> > >> > From: beansproud [mailto:[EMAIL PROTECTED]
> > >> > Sent: lundi, 16. juin 2008 16:42
> > >> > To: [email protected]
> > >> > Subject: where nutch store crawled data
> > >> >
> > >> >
> > >> > Hi,
> > >> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > >> > get
> > >> > the crawled data by using the command : nutch readseg.
> > >> >     My question is can I get the data directly ? I just can't find
> > where
> > >> > nutch put them.
> > >> >     Can anybody tell me ?
> > >> >     Thanks very much!
> > >> > --
> > >> > View this message in context:
> > >> >
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > >> > .html
> > >> > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >> >
> > >> >
> > >> >
> > >>
> > >> --
> > >> View this message in context:
> > >>
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > [EMAIL PROTECTED]
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
> > >
> >
> >
> >
> > --
> > Chris Anderson
> > http://jchris.mfdz.com
> >
> 
> 
> 
> -- 
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [EMAIL PROTECTED]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

Re: where nutch store crawled data

Reply via email to