On 11/14/13 1:13 PM, "Joshua Ladd" <josh...@mellanox.com> wrote:
>Let me try to summarize my understanding of the situation: > >1. Ralph made the OOB asynchronous. > >2. OOB cpcs don't work as a result of 1, and are thereby "deprecated", >meaning: won't fix. > >3. Pasha moved the openib/connect to common/ofacm but excluded the rdmacm >in that move. Never changed openib to use ofacm/common. > >4. UDCM is "functional" in the trunk, still sitting in openib/connect. >But no one is entirely sure if it really works which is why it was >disabled in 1.7. Nathan - is there a design doc you can share on this >beyond the comments in the code? > >5. In order to satisfy the "grand plan": > a. UDCM still needs to be moved to common/ofacm. > b. OpenIB still needs to be changed to use common/ofacm. > c. RDMACM still needs to migrate to common/ofacm. > d. XRC support needs to be added to UDCM and put into >common/ofacm. > >6. The "grand plan" being: move the BTLs into Opal - hence the need to >scuttle the OOB cpcs thereby justifying the deprecation and not fixing >cpcs after #1. > >So, that's a quick roundup of how we ended up here (as I understand it.) >What needs to be done is: That's my understanding as well. >1. Somebody needs to certify/review/ that what Nathan has done is sound. >From my perspective, this is a BIG change and needs a comprehensive >architecture review. We've been using it in the trunk, and we've been >testing it under MTT for some time - but have not deployed or tested at >large-scale out in the field. Would be nice to see something on paper in >terms of a design doc. > >2. Somebody then needs to move UDCM into common/ofacm. > >3. Somebody needs to change openib to use common/ofacm cpcs instead of >openib/connect cpcs. > >4. Somebody needs to move RDMACM into common/ofacm and make sure RoCEE >works. > >5. Somebody needs to add XRC support to UDCM - whatever that might mean. >Given Nathan added UDCM back in 2011 and nobody is really sure it's ready >for prime-time, and given Pasha's comments regarding the difference in >state machine requirements between the two connection schemes, this >doesn't seem like a trivial task. > >Given Nathan's comments a second ago about ORNL not supporting the IB >Offload component, it barely makes sense to keep common/ofacm. And it >sounds like the two cpcs presently contained therein are now unusable. > >All of this work is a result of the Grand Plan to move the BTLs into the >Opal layer - which I have no idea what the motive is (I was not involved >with OMPI when this was decided or debated.) > >Basically, without these five changes OpenIB is dead in 1.7.4 and beyond >for RC, XRC, and RoCEE. These are blockers to 1.7.4 and I don't believe >that the onus falls squarely on Mellanox to fix these. These were >community decisions and, as such, it must be a community effort to >resolve. We are happy to lend a hand, but we are not fixing all of this >mess. I think that the 5 steps above sound correct and I agree that 1) this means 1.7.4 is on hold until we fix this and 2) that Mellanox shouldn't be the only one to fix this for 1.7.4, given the amount of work involved. Ralph, what, specifically, broke about the oob/xoob cpc mechanisms by making the oob asynchronous? That is, 1-5 are a huge amount of work; have we done the analysis to say that updating the oob / xoob cpc to work with the new oob is actually more work than doing 1-5? Obviously, there's long term plans that make oob/xoob problematic. But those aren't 1.7 / 1.8 plans. Unfortunately, the cpcs were always out of my area of interest, so I'm flying a bit more blind than I'd like here. Brian -- Brian W. Barrett Scalable System Software Group Sandia National Laboratories