On Tue, 2004-11-16 at 23:02 +0100, Andrzej Bialecki wrote: > Peter A. Daly wrote: > > "search" server. That means a 2b page index would need 100 search > > servers, probably each with 4gig RAM and a "normal" sized hard disk. I > > ... which would give you ~200 servers. For a whole web search engine > this is not a very high number, if you compare it with 20,000+ servers > at Google... ;-)
Google had 100K+ servers according to a press report a year or two ago. Rumor has it that the number is larger now, although it would be inconsiderate of me to share the larger rumored number. > We won't know until someone tries this... I can't see for the moment an > inherent limit on the number of search servers, except for the traffic > overhead, and a possible CPU/memory bottleneck for the search front-end > - which could be solved by introducing intermediate search nodes for > merging partial results. I haven't tried this, but my inspection of the code suggests that you could set up the intermediate nodes without any code changes --- DistributedSearch.Server wraps a NutchBean just as the JSP does. I think that a system with this large a number of parts will work badly if it is not fault-tolerant --- both for small "faults," such as network congestion requiring a packet retransmission, and for large faults like the loss of a machine. Also there are some other theoretical difficulties of scaling that I have thought of: - adding more machines lets you answer the same number of queries over a larger corpus, but there is no provision to divide up indices across machines so that not every query involves every machine; it seems that perhaps fixing this could reduce fragility (by having fewer points of failure for any particular query), reduce network bandwidth, and reduce aggregate CPU usage --- so you can answer *more* queries over a larger corpus. - the final phase of DistributedAnalysisTool, where the pagedb gets updated with the new scores, appears not to be parallel, and it streams through an awful lot of data, so I invoke Amdahl's law. - other things that touch an awful lot of data in a non-parallel fashion include fetchlist generation and UpdateDatabaseTool Unfortunately, thinking about things theoretically might tell me where some bottlenecks are, but it won't tell me where the most important bottlenecks are or how big your crawl has to get before you run out of steam. More suggestive is that both Overture's defunct demo of a year or two ago, and Mike Cafarella's current projects, have gone up to about 100 million pages and then scaled, as far as I know, no further. Maybe bigger deployments have happened that I don't know about? But I'm guessing these two groups ran into scaling limits, especially from Mike's recent NDFS work. ------------------------------------------------------- This SF.Net email is sponsored by: InterSystems CACHE FREE OODBMS DOWNLOAD - A multidimensional database that combines robust object and relational technologies, making it a perfect match for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
