On Tue, 2004-11-16 at 23:02 +0100, Andrzej Bialecki wrote:
> Peter A. Daly wrote:
> > "search" server.  That means a 2b page index would need 100 search 
> > servers, probably each with 4gig RAM and a "normal" sized hard disk.  I 
> 
> ... which would give you ~200 servers. For a whole web search engine 
> this is not a very high number, if you compare it with 20,000+ servers 
> at Google... ;-)

Google had 100K+ servers according to a press report a year or two ago.
Rumor has it that the number is larger now, although it would be
inconsiderate of me to share the larger rumored number.

> We won't know until someone tries this... I can't see for the moment an 
> inherent limit on the number of search servers, except for the traffic 
> overhead, and a possible CPU/memory bottleneck for the search front-end 
> - which could be solved by introducing intermediate search nodes for 
> merging partial results.

I haven't tried this, but my inspection of the code suggests that you
could set up the intermediate nodes without any code changes ---
DistributedSearch.Server wraps a NutchBean just as the JSP does.

I think that a system with this large a number of parts will work badly
if it is not fault-tolerant --- both for small "faults," such as network
congestion requiring a packet retransmission, and for large faults like
the loss of a machine.

Also there are some other theoretical difficulties of scaling that I
have thought of:

- adding more machines lets you answer the same number of queries over a
larger corpus, but there is no provision to divide up indices across
machines so that not every query involves every machine; it seems that
perhaps fixing this could reduce fragility (by having fewer points of
failure for any particular query), reduce network bandwidth, and reduce
aggregate CPU usage --- so you can answer *more* queries over a larger
corpus.

- the final phase of DistributedAnalysisTool, where the pagedb gets
updated with the new scores, appears not to be parallel, and it streams
through an awful lot of data, so I invoke Amdahl's law.

- other things that touch an awful lot of data in a non-parallel fashion
include fetchlist generation and UpdateDatabaseTool

Unfortunately, thinking about things theoretically might tell me where
some bottlenecks are, but it won't tell me where the most important
bottlenecks are or how big your crawl has to get before you run out of
steam.

More suggestive is that both Overture's defunct demo of a year or two
ago, and Mike Cafarella's current projects, have gone up to about 100
million pages and then scaled, as far as I know, no further.  Maybe
bigger deployments have happened that I don't know about?  But I'm
guessing these two groups ran into scaling limits, especially from
Mike's recent NDFS work.


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to