Look how the Fetcher class writes a segment. Then you can skip the steps. Generate, fetch, updatedb and just run merge segments and index.
/M On Tue, Jun 17, 2008 at 7:02 PM, Winton Davies <[EMAIL PROTECTED]> wrote: > I have a follow up question - is it possible to directly write to the Crawl > DB. I have several million HTML pages that are stored in a single > concatenated flat file, and I'd like to just run a utility over them to > feed > them to Nutch parsing/indexing rather than have to dump as individual > files. > Looking at the API documentation I'd couldnt find any obvious capabilities. > > I've no idea if the fetch -> crawldb does the parse and url extraction > before it writes it anyway. If it's not possible, then it doesnt matter, > but > if it's possible, it would save having to write out lots of files. > > Winton > > > > On Tue, Jun 17, 2008 at 6:57 AM, beansproud <[EMAIL PROTECTED]> > wrote: > > > > > oh, you are right. > > thanks > > > > > > POIRIER David wrote: > > > > > > When executing a crawl, Nutch creates segments, based on the crawel > > > depth if I'm not mistaking, in which the fetched content is stored. For > > > example, if crawling a web site named site-xyz, into the directory > > > $nutch_home/crawls/crawl-xyz, you will find the segments into the > > > following directory: $nutch_home/crawls/crawl-xyz/segments. For each > > > segment directory you will find a content directory. > > > > > > To be honest, I don't think you can directly access the stored content > > > found in thoses directories, the idea being to index it and not > > > necesserely store it. > > > > > > David > > > > > > > > > > > > -----Original Message----- > > > From: beansproud [mailto:[EMAIL PROTECTED] > > > Sent: lundi, 16. juin 2008 16:42 > > > To: [email protected] > > > Subject: where nutch store crawled data > > > > > > > > > Hi, > > > I'm fresh for nutch.And when I use nutch for crawling pages.I can > > > get > > > the crawled data by using the command : nutch readseg. > > > My question is can I get the data directly ? I just can't find > where > > > nutch put them. > > > Can anybody tell me ? > > > Thanks very much! > > > -- > > > View this message in context: > > > > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17865961 > > > .html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > > > > > -- > > View this message in context: > > > http://www.nabble.com/where-nutch-store-crawled-data-tp17865961p17905486.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/
