Hi Ken!! Thanks for the short conversation regarding the project. You've more or less confirmed some of the initial thoughts/research that we've observed.
Regards, -Bruce Douglas [EMAIL PROTECTED] -----Original Message----- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Monday, August 28, 2006 12:31 PM To: [email protected] Subject: Re: processing parallel sites >can nutch/lucene handle getting content from 100s-1000s of pages >sinultaneously.... If you mean "Can Nutch handle fetching 1000s of pages at a time", the answer is yes. If you mean "Can Lucene, when used as the IR engine for Nutch, handle searching 1000s of pages at a time", then answer is also yes. >if it can, how does it write the content to the resulting/output db. > >does it actually perform 100s-1000s of simultaneous connections to a backend >db. > >does it utilize writing the output files to a filesystem, which is then >somehow inserted into a db.... The fetched pages get written to a sequential file. After a fetch cycle, additional data about the page state gets processed, and the results are used to update the crawldb, which is a kind of specialized database for web crawling. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
