List,
 
For my master thesis I took up a project that requires mapping of a number of 
statically defined parallel jobs into a more dynamic environment that allows 
better scaling.  
The situation as described below let me to believe a cluster or distributed 
queue (DrQueue?) solution is necessary. For the situation see [situation] at 
the end of this email.
 
Because I am new in this field I ask for a bit of your time to help me get my 
bearings on current work in the field and good documentation. 
 
To be able to see if there are any suitable (or near enough) environments, I 
made a list of capabilities that this environment should have:
* Dynamic load balancing, either by process migration or stopping jobs and 
starting them somewhere else.
* Dynamic decision on the degree of parallelism, according to the dataset that 
needs to be processed (growing/shrinking).
* Failover of the jobs when node failure happens.
* The guarantee that a job runs only once in the cluster, even during node 
failure. 
* Limiting jobs to a class of nodes (subset of the total of nodes)
 
Do you know of any projects that have these capabilities? 
HPC clustering seems to come close but I don't know about the dynamic degree of 
parallelism, isn't that defined at job submission?
And when using process migration, do IP connections also migrate, in other 
words will database connections stay intact during process migration?
 
I also hope you have some favorite resources on the subject, especially on 
methods that can be used for these capabilities.
 
 
[situation]
Over the years we have created a small (about 40) number of jobs that support 
the main function of our business ( an online social community). Typical jobs 
include aggregation of data, Queue processing, automated email notifications, 
video/photo rendering. A common factor is the need for database connections for 
all these scripts. 
 
For scalability issues most of these jobs are parallelized, sometimes the 
dataset is partitioned, and processing is done in manageable chunks.  Each job 
basically run in a while(true) with a bit of sleep after a chunk is processed, 
so not to overwhelm the machine's when all data is processed.  Some jobs, 
though, cannot be split and therefore cannot run in parallel since this would 
cause data corruption.
 
To run these job we have about 10 nodes, configuration is done statically 
through a configuration file. The configuration defines how many instances 
there need to run, sometimes even where to run (crude load balancing).  Because 
of our growing volume of users there is a need to identify which job cannot 
keep up and adjust the configuration accordingly. This is a cumbersome job that 
has grown out of habit and introduces in efficient use of the resources (both 
human and machine alike).
 
With regards,

Jos Houtman
System administrator Hyves.nl
email: [EMAIL PROTECTED]


--
gentoo-cluster@lists.gentoo.org mailing list

Reply via email to