On Aug 2, 2008, at 11:46, Terry Dontje <terry.don...@sun.com> wrote:

Jeff Squyres wrote:
On Aug 1, 2008, at 11:39 PM, Brian Barrett wrote:

My thought is that if add_procs fails, then that BTL should be removed (as if init failed) and things should continue on. If that BTL was the only way to reach another process, we'll catch that later and abort.

There are always going to be errors that can't be detected until the device is actually used, so I think that add_procs errors should be treated the same as init errors. The error_cb is a red herring, as that's supposed to be used in situations where an error can't directly be returned to the upper layers (like the progress function). In this case, we can directly return an error, so we should do so (and I believe we do, it's the BML/PML that's the problem).

So if add_procs() fails, do you think that the BML/PML should finalize the module? That looks like an easy change to make.

Second, if there are no other successfully-add_proc()'ed modules from that component, should the BTL's progress function be removed from the list of progress functions? The real question is: if a module add_procs() fails, do we mandate that it still must be safe to call the component's progress function? I think you're saying "yes", but just wanted to be sure. I don't know offhand how a component's progress function is added to the list (can't check ATM), so I'd have to dig into that a bit.

I am curious how all of the above affects client/server or spawned jobs. If you finalize a BTL then do a connect to a process that would use that BTL would it reinitialize itself?

To deal with all the dynamics issues, I wouldn't finalized the BTL. The BML should handle the progress stuff, just as if the add_procs succeeded but returned no active peers. But I'd guess that's part of the bit that doesn't work today. I would further suspect that a BTL will need to have a working progress function in the face of add_procs failures to cope with all the dynamics options. I'm travelling this weekend, so I can't verify any of this at the moment.

Brian

Reply via email to