Hi Ken!! Thanks for the short conversation regarding the project. You've more or less confirmed some of the initial thoughts/research that we've observed.
Regards, -Bruce Douglas [EMAIL PROTECTED] -----Original Message----- From: Ken Krugler [mailto:[EMAIL PROTECTED] Sent: Monday, August 28, 2006 12:31 PM To: [email protected] Subject: Re: processing parallel sites >can nutch/lucene handle getting content from 100s-1000s of pages >sinultaneously.... If you mean "Can Nutch handle fetching 1000s of pages at a time", the answer is yes. If you mean "Can Lucene, when used as the IR engine for Nutch, handle searching 1000s of pages at a time", then answer is also yes. >if it can, how does it write the content to the resulting/output db. > >does it actually perform 100s-1000s of simultaneous connections to a backend >db. > >does it utilize writing the output files to a filesystem, which is then >somehow inserted into a db.... The fetched pages get written to a sequential file. After a fetch cycle, additional data about the page state gets processed, and the results are used to update the crawldb, which is a kind of specialized database for web crawling. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
