Alright, well I hacked up a copy of ClusterManagers such that I added an 
obtain_procs() function that actually gets the available process and 
generates and stores the relevant WorkerConfigs with the ClusterManager. 
 This function does not require any locks to be obtained.

Then addprocs() (specifically the new ClusterManagers.launch function) just 
adds the generated WorkerConfigs to the instances_arr.  The locking around 
addprocs doesn't cause any problems now since the function is very fast.

If anyone sees any problems with this, please let me know.  Or, if you know 
exactly why the lock must be grabbed with addprocs, let me know.  I'm 
hoping it's really just a lock on the instances_arr as far as launch is 
concerned.  I couldn't see anything else in the old ClusterManagers.launch 
that might have concurrent access problems although I didn't dive into 
WorkerConfig().  If that needs a lock, I'll have to create one, but that 
seems odd to me.

In any case, a design change seems necessary here.  I think either addprocs 
needs to be changed so that locks are only taken at short critical times - 
this would probably require cluster managers to implement their code with 
proper locking of resources.  Alternatively, a two function architecture 
could start to be used, like I implemented, where addprocs is always fast.

Thanks.

Reply via email to