Alright, well I hacked up a copy of ClusterManagers such that I added an obtain_procs() function that actually gets the available process and generates and stores the relevant WorkerConfigs with the ClusterManager. This function does not require any locks to be obtained.
Then addprocs() (specifically the new ClusterManagers.launch function) just adds the generated WorkerConfigs to the instances_arr. The locking around addprocs doesn't cause any problems now since the function is very fast. If anyone sees any problems with this, please let me know. Or, if you know exactly why the lock must be grabbed with addprocs, let me know. I'm hoping it's really just a lock on the instances_arr as far as launch is concerned. I couldn't see anything else in the old ClusterManagers.launch that might have concurrent access problems although I didn't dive into WorkerConfig(). If that needs a lock, I'll have to create one, but that seems odd to me. In any case, a design change seems necessary here. I think either addprocs needs to be changed so that locks are only taken at short critical times - this would probably require cluster managers to implement their code with proper locking of resources. Alternatively, a two function architecture could start to be used, like I implemented, where addprocs is always fast. Thanks.