Brian and I chatted about this on the phone today. Conclusions that we came to:

1. We need to add a few lines of code to ensure that the MCA base refuses to open components that have a different MCA version number (i.e., dlopen a DSO, dlsym to get the component struct, check the version number, if it's not the same MCA major.minor as our MCA major.minor, dlclose it). This is easy to do; I'll add it to the hg.

2. Let's set the precedent now that changing the MCA version does *not* force a change of all the framework version numbers. The framework version numbers refer to their interfaces. Rather, it's a triple of (MCA,framework,component) version numbers that uniquely identify a component.

3. The load-time issues of mixing multiple MCA versions are solved by points #1 and #2.

4. Leave the bump of all framework versions to 2.0 in place because a good number of them had to be bumped anyway. We're probably bumping a few that didn't actually need to be bumped (i.e., those that didn't actually change since the v1.2 series), but what the heck -- most of them have changed, and it's a bunch of work to roll all that out. So let's just bump them, but not because we bumped the MCA version number; rather, we bump them because we knew that most of them needed to be bumped, but were too lazy to check and see exactly which ones needed it (hey, let's be honest here...).

If no one has any objections to this, I'll bring this stuff into the trunk at the original timeout -- Friday COB (i.e., tomorrow).




On Jul 21, 2008, at 8:55 PM, Jeff Squyres wrote:

On Jul 21, 2008, at 6:57 PM, Brian W. Barrett wrote:

I guess I don't understand. I thought there were three versions in every component -- the MCA version, the framework version, and the component version. The first two should determine if the component can safely be loaded and the third is to identify the component. I agree that for this change (an MCA-level change), the MCA version *should* change. However,
the framework interface didn't change (well, not as a result of this
change), meaning that the framework version *should not* change. The MCA load infrastructure should see that the MCA versions don't match, and not
load the component.


Josh and I wrestled with this question for a bit and probably fell down on the side of conservatism; that's where this came from. There were two reasons why we went this way:

1. You could (for example) have a coll framework v1.2.3 component built with MCA v1.0.0 and the same coll framework v1.2.3 component built against MCA v2.0.0, and they would be different. Worse, they won't be "equal". Specifically, MCA 2.0.0 supports some minor features that v1.0.0 doesn't -- so even though you have 2 of the "same" component, they're not really the same. (*more on this below)

2. Another issue seemed pretty icky to solve, which led us to fall down a little heavier on the side of bumping all the framework version numbers. Let's say you have some Foo framework DSOs, some of which are MCA v1.0.0 and some of which are v2.0.0. The Foo framework interface is the same between the two. The MCA base can find/open all of them easily enough; but how do we return all the components to the caller? I could think of 3 ways:

A. return multiple lists to the caller: a list of each of v1.0.0 and v2.0.0 components. This means that every framework will need to handle (or be able to reject or specify to the MCA base to reject before even accepting as available) both MCA v1.0.0 and v2.0.0 components.

B. return a single list to the caller with both MCA component versions in the list. Pretty much the same as #1, but it scales better if we get in the business of changing the MCA version a lot (please God no); I mention it mainly for completeness.

C. return a single list to the caller with all components "upgraded" to MCA v2.0. This seems like a nice solution -- a la the experiment we tried with coll a long time ago to prove to ourselves that run-time versioning could work (for those of you who don't remember: we had some coll v1.0.0 and some v1.1.0 components; the coll base transparently handled everything at run-time). However, there's a problem with this idea: since all frameworks use the component struct as a "super" for their component structs, the MCA base does not know the total size of the component public struct. So it cannot "upgrade" the MCA v1.0.0 structure in memory to a v2.0.0, because the v2.0.0 struct is bigger than the v1.0.0 struct. So we can't just magically treat everything as v2.0.0 components at the MCA base level; we'd have to have the frameworks transmorgify their own components (although we might be able to have some MCA base helper function that does the heavy lifting, as long as the framework supplied the total struct length).

Note that all three of these solutions involves touching every framework in some way (although not every component).

----

All that being said, I suppose there's two arguments against these kinds of issues:

- this situation probably won't happen in practice (component A compiled against MCA v1.0.0 and against MCA v2.0.0) because we only distribute components as part of full OMPI releases, and therefore they're fairly tightly bound to their MCA version. However, for components that didn't change between OMPI v1.2 and v1.3, you *will* have this scenario, but in different OMPI installation directories (and therefore it pretty much doesn't matter).

- I think the crux of Brian's argument is the framework's version number is identifying *the framework's* interface -- not the whole interface (i.e., not including the MCA base interface). From this perspective, it *is* independent of the MCA version number. Specifically: the version of the framework interface is independent of the binary compatibility and features issues surrounding the MCA base.

-----

So Josh and I thought we picked a solution that was clear, simple, and one-of-several sucky options. :-\ We could probably be convinced to go another way if someone has strong feelings.

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to