Jason Boss wrote:
Thanks for the reply.  Say for instance we want to index about 1/2 billion
pages, how many computers using the distributed search method would you
need?  And to get fast decent results of those 1/2 billion pages what is the
recommended hardware needed to make that happen?

A single search node can typically handle up to around 20M pages before it gets too slow. A system with ~20M pages per node can usually process a couple of searches per second. So, for a 500M page system you'd need around 25 machines. You might squeak by with as few as 10, but some searches would be pretty slow.


These numbers are rough estimates.  Your milage may vary.

Is there a ratio of ram to pages to keep the index moving fast?

In general, a machine with more memory can handle a larger index. In particular, there's a knee in the curve around 2k/page. If you have more RAM than that, then the entire index will fit in memory and things run much faster. If you have much less RAM than that then most queries will have to do some disk i/o, which significantly reduces throughput.


Also I noticed on Yahoo's test of Nutch the results seem fairly snappy.
Have they modified the search portion of Nutch or is this all from the
distributed?

This is distributed. The original demo distributed it over three boxes. I don't know how many it is distributed over now, perhaps a few more.


And last but not least, I have a few questions on the distributed search
part.  If you have 5 servers and want to distribute the index over 5
servers, do you load Nutch up on all 5 servers and create the
search-servers.txt for each box?  I was trying to figure out from the Wiki,
but it talks about using the port numbers.  Is there any other docs on the
server command or could you give me some quick points on that?

You don't need a search-servers.txt on the backend machines, running 'bin/nutch server', only on the front-end web server machines, running Tomcat. Of course, these might be the same machines.


Doug


------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to