I have a follow up question - is it possible to directly write to the Crawl
DB. I have several million HTML pages that are stored in a  single
concatenated flat file, and I'd like to just run a utility over them to feed
them to Nutch parsing/indexing rather than have to dump as individual files.
Looking at the API documentation I'd couldnt find any obvious capabilities.

I've no idea if the fetch -> crawldb does the parse and url extraction
before it writes it anyway. If it's not possible, then it doesnt matter, but
if it's possible, it would save having to write out lots of files.

Winton



On Tue, Jun 17, 2008 at 6:57 AM, beansproud <[EMAIL PROTECTED]>
wrote:

>
> oh, you are right.
> thanks
>
>
> POIRIER David wrote:
> >
> > When executing a crawl, Nutch creates segments, based on the crawel
> > depth if I'm not mistaking, in which the fetched content is stored. For
> > example, if crawling a web site named site-xyz, into the directory
> > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > segment directory you will find a content directory.
> >
> > To be honest, I don't think you can directly access the stored content
> > found in thoses directories, the idea being to index it and not
> > necesserely store it.
> >
> > David
> >
> >
> >
> > -----Original Message-----
> > From: beansproud [mailto:[EMAIL PROTECTED]
> > Sent: lundi, 16. juin 2008 16:42
> > To: [email protected]
> > Subject: where nutch store crawled data
> >
> >
> > Hi,
> >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > get
> > the crawled data by using the command : nutch readseg.
> >     My question is can I get the data directly ? I just can't find where
> > nutch put them.
> >     Can anybody tell me ?
> >     Thanks very much!
> > --
> > View this message in context:
> > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > .html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Reply via email to