You can fetch it but it is not pretty.

It is just a SequenceFileInputFormat:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html

Look in the org.apache.nutch.crawl.Crawl class and specifically how it uses
the Indexer.

Kindly

//Marcus

On Tue, Jun 17, 2008 at 3:57 PM, beansproud <[EMAIL PROTECTED]>
wrote:

>
> oh, you are right.
> thanks
>
>
> POIRIER David wrote:
> >
> > When executing a crawl, Nutch creates segments, based on the crawel
> > depth if I'm not mistaking, in which the fetched content is stored. For
> > example, if crawling a web site named site-xyz, into the directory
> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > segment directory you will find a content directory.
> >
> > To be honest, I don't think you can directly access the stored content
> > found in thoses directories, the idea being to index it and not
> > necesserely store it.
> >
> > David
> >
> >
> >
> > -----Original Message-----
> > From: beansproud [mailto:[EMAIL PROTECTED]
> > Sent: lundi, 16. juin 2008 16:42
> > To: [email protected]
> > Subject: where nutch store crawled data
> >
> >
> > Hi,
> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > get
> > the crawled data by using the command : nutch readseg.
> >     My question is can I get the data directly ? I just can't find where
> > nutch put them.
> >     Can anybody tell me ?
> >     Thanks very much!
> > --
> > View this message in context:
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > .html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Reply via email to