Look how the Fetcher class writes a segment.
Then you can skip the steps.

Generate, fetch, updatedb and just run merge segments and index.

/M

On Tue, Jun 17, 2008 at 7:02 PM, Winton Davies <[EMAIL PROTECTED]> wrote:

> I have a follow up question - is it possible to directly write to the Crawl
> DB. I have several million HTML pages that are stored in a  single
> concatenated flat file, and I'd like to just run a utility over them to
> feed
> them to Nutch parsing/indexing rather than have to dump as individual
> files.
> Looking at the API documentation I'd couldnt find any obvious capabilities.
>
> I've no idea if the fetch -> crawldb does the parse and url extraction
> before it writes it anyway. If it's not possible, then it doesnt matter,
> but
> if it's possible, it would save having to write out lots of files.
>
> Winton
>
>
>
> On Tue, Jun 17, 2008 at 6:57 AM, beansproud <[EMAIL PROTECTED]>
> wrote:
>
> >
> > oh, you are right.
> > thanks
> >
> >
> > POIRIER David wrote:
> > >
> > > When executing a crawl, Nutch creates segments, based on the crawel
> > > depth if I'm not mistaking, in which the fetched content is stored. For
> > > example, if crawling a web site named site-xyz, into the directory
> > > $nutch_home/crawls/crawl-xyz, you will find the segments into the
> > > following directory: $nutch_home/crawls/crawl-xyz/segments. For each
> > > segment directory you will find a content directory.
> > >
> > > To be honest, I don't think you can directly access the stored content
> > > found in thoses directories, the idea being to index it and not
> > > necesserely store it.
> > >
> > > David
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: beansproud [mailto:[EMAIL PROTECTED]
> > > Sent: lundi, 16. juin 2008 16:42
> > > To: [email protected]
> > > Subject: where nutch store crawled data
> > >
> > >
> > > Hi,
> > >     I'm fresh for nutch.And when I use nutch for crawling pages.I can
> > > get
> > > the crawled data by using the command : nutch readseg.
> > >     My question is can I get the data directly ? I just can't find
> where
> > > nutch put them.
> > >     Can anybody tell me ?
> > >     Thanks very much!
> > > --
> > > View this message in context:
> > >
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961
> > > .html
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Reply via email to