Re: [Nutch-dev] Nutch deployment, support, high-availability

Byron Miller Sun, 18 Jul 2004 11:42:42 -0700

Grigory,

There are many ways to achieve what you are looking
for - although nothing "Automatic" out of the box.

For search servers, you would simply run the
distributed search server processes on your query
servers. On your JSP/web server you would create the
search hosts file that has the host and port number of
all of your query servers.

If you want redundancy, the easiest way is to create
round robine dns for your query servers.  So when you
create your search servers.txt you would define your
load balanced url for the query servers and DNS would
delegate which specific host it would hit.

Generally for creating indexes, i do about 1 million
at a time and merge up to the factor of the hardware
i'm running on (memory and cpu load).  So typically i
would say 4-10 million pages on a single server.

As for updating while the system is up - currently you
have to bounce the web services when you add new
indexes as it isn't aware of changes - however for
nutch/lucene you can do almost all admin functionality
(index, fetch, generate segments) without worrynig
about locks and such - if there is a lock issue you
will usually be alerted to it :)

To replicate your data, you can look at the NDFS code,
use rsync or one of the distributed file systems such
oas Coda, GFS or even AFS if you want more of a nfs
type system.

Alot of nutch is left for you to scale based upon your
requirements, although with the recent JMX work and
configuration stuff being done it will be easier to
manage such systems and integrate a distributed
architecture.

If you say more of what your trying to accomplish we
may be able to assist more. I would recommend reading
the lists.  If you want direct help, feel free to
email me directly.

-byron

--- gbeg <[EMAIL PROTECTED]> wrote:
> Hello all,
> 
> I am particularly interested in the following
> issues:
> 
> 1. Deployment of nutch
>       - how to establish the search system with
> several search servers, 
>       - how to divide the "data" between them, 
>       - how to perform scheduled refetchings, 
>       - how many fetchers should be, etc.
> 
> 2. High availability/redundancy of the system. 
>       - How to update the indexes / webdb while
> keeping the system alive, 
>       - how to replicate the data, 
>       - what happens when a search server goes
> offline, 
>       - how to make webdb redundant, etc.
> 
> Please point me to the resources if there are any,
> otherwise let's gather the knowledge and create the
> appropriate docs :)
> 

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Nutch deployment, support, high-availability

Reply via email to