The Mellanox 2.33.5100 firmware upgrade that came out a few days ago did indeed fix the problem we were seeing with the mlx4 errors. Thanks for pointing us in that direction.
Dave Turner On Thu, Jan 29, 2015 at 11:00 AM, <devel-requ...@open-mpi.org> wrote: > Send devel mailing list submissions to > de...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/devel > or, via email, send a message with subject or body 'help' to > devel-requ...@open-mpi.org > > You can reach the person managing the list at > devel-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of devel digest..." > > > Today's Topics: > > 1. Re: mlx4 QP operation err (Christopher Samuel) > 2. Re: mlx4 QP operation err (Devendar Bureddy) > 3. Re: MTL interfaces (Todd Kordenbrock) > 4. Re: For discussion tomorrow: MTL issues (Friedley, Andrew) > 5. Webex for this morning (Jeff Squyres (jsquyres)) > 6. Re: For discussion tomorrow: MTL issues (Jeff Squyres (jsquyres)) > 7. Re: Webex for this morning (Jeff Squyres (jsquyres)) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 29 Jan 2015 11:52:46 +1100 > From: Christopher Samuel <sam...@unimelb.edu.au> > To: de...@open-mpi.org > Subject: Re: [OMPI devel] mlx4 QP operation err > Message-ID: <54c9845e.90...@unimelb.edu.au> > Content-Type: text/plain; charset=windows-1252 > > Hi Dave, > > On 29/01/15 11:31, Dave Turner wrote: > > > I've found some old references to similar mlx4 errors dating back > to > > 2009 that lead me to believe this may be a firmware error. I believe > we're > > running the most up to date version of the firmware. > > There was a new version released a few days ago, 2.33.5100: > > http://www.mellanox.com/page/firmware_table_ConnectX3ProEN > > Release notes are here: > > > http://www.mellanox.com/pdf/firmware/ConnectX3Pro-FW-2_33_5100-release_notes.pdf > > Bug fixes start on page 23, looks like there are 29 fixes > in this version, and fix 1 is for RoCE (though of course may > not be relevant) - "The first Read response was not treated as > implicit ACK" (discovered in 2.30.8000). > > All the best, > Chris > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > > > ------------------------------ > > Message: 2 > Date: Thu, 29 Jan 2015 01:00:53 +0000 > From: Devendar Bureddy <deven...@mellanox.com> > To: "drdavetur...@gmail.com" <drdavetur...@gmail.com>, Open MPI > Developers <de...@open-mpi.org> > Subject: Re: [OMPI devel] mlx4 QP operation err > Message-ID: > < > am2pr05mb0771310015102ed8ed29d1b1ae...@am2pr05mb0771.eurprd05.prod.outlook.com > > > > Content-Type: text/plain; charset="utf-8" > > are you able to reproduce this error with ib verbs bw test? I hope, you > are running on lossless Ethernet fabric setup and selecting correct VLAN . > > -Devendar > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Dave Turner > Sent: Wednesday, January 28, 2015 4:31 PM > To: de...@open-mpi.org > Subject: [OMPI devel] mlx4 QP operation err > > > I'm testing RoCE on 40 Gbps Mellanox ethernet cards and am getting a > mlx4 QP operation error every time it gets to testing 132 kB packets. > These > are aggregate tests in that 16 cores on one host are doing bi-directional > ping-pongs to 16 cores on another host across the Mellanox cards. > > I've found some old references to similar mlx4 errors dating back to > 2009 that lead me to believe this may be a firmware error. I believe we're > running the most up to date version of the firmware. > > Could someone comment on whether these are firmware issues, and > if so how to report them to Mellanox? I've attached some files with more > detailed information on this problem. > > Dave Turner > > -- > Work: davetur...@ksu.edu<mailto:davetur...@ksu.edu> (785) 532-7791 > 118 Nichols Hall, Manhattan KS 66502 > Home: drdavetur...@gmail.com<mailto:drdavetur...@gmail.com> > cell: (785) 770-5929 > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 3 > Date: Wed, 28 Jan 2015 22:45:02 -0600 > From: Todd Kordenbrock <thkgc...@gmail.com> > To: Open MPI Developers <de...@open-mpi.org> > Subject: Re: [OMPI devel] MTL interfaces > Message-ID: > < > caegoymdbqszdqqpb2dwcth392eds6jkkxqba1ffxzayereh...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi Jeff, > > I can attend at that time. > > todd > > > On Wed, Jan 28, 2015 at 3:55 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com > > wrote: > > > Ryan / Sandia (anyone else who cares about MTL interfaces): > > > > Can you attend a webex tomorrow at 1pm US Central to discuss adding > > one-sided interfaces to the MTL? (it must be before 2pm US Central) > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/01/16831.php > > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 4 > Date: Thu, 29 Jan 2015 14:52:43 +0000 > From: "Friedley, Andrew" <andrew.fried...@intel.com> > To: Open MPI Developers <de...@open-mpi.org> > Subject: Re: [OMPI devel] For discussion tomorrow: MTL issues > Message-ID: > < > 0429c22ebdb44040b478f8f77ef14518ca3...@orsmsx114.amr.corp.intel.com> > Content-Type: text/plain; charset="us-ascii" > > Is there anything written up about recent Open MPI one-sided work? > Looking for something beyond just the code that I can study up on.. papers, > design docs, future plans, etc. > > Andrew > > > -----Original Message----- > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) > > Sent: Wednesday, January 28, 2015 4:26 PM > > To: Open MPI Developers List > > Subject: [OMPI devel] For discussion tomorrow: MTL issues > > > > MTL authors -- > > > > We had *some* discussion of MTL issues this afternoon in the room, but > > need your input (since most of you are not here). Here's what we'd like > to > > talk about tomorrow (and we realize you might not have answers for this > > tomorrow). > > > > Short version: based on Mellanox's experience, why not ditch the CM PML > > and have all current MTLs move up to be PMLs? > > > > More detail: > > > > We all know that Mellanox moved their MXM MTL up to be a PML. The short > > version of "why did they do this?" is because CM really added no value > for > > MXM. Literally, all it did was add overhead: > > > > 1. translate some OMPI data structures to a neutral/CM data structure 2. > > which was then translated into the MXM data structures 3. then call MXM > > > > So why not chop out one of those layers: > > > > 1. translate OMPI data structures into MXM data structures 2. then call > MXM > > > > Taking a crass look at the existing MTLs, we wonder if it would be > worthwhile > > to do the same thing for all of them. It doesn't seem (to us) that it > would be > > a lot of work -- the PML and MTL interfaces are quite similar. And > there could > > be message rate improvements for those MTLs-turned-PMLs, just like it did > > for MXM/yalla. > > > > *If* this is a good assumption -- that MTLs should all become PMLs -- > then > > MPI one-sided operations become the next logical question. I.e., what > > happens when you call MPI_PUT / MPI_GET / etc.? > > > > Right now, you'll end up using the osc/pt2pt component, which will use > PML > > calls to effect MPI RMA functionality over the PML interface. Which is > fine, > > and will work correctly in all cases. > > > > However, MTL-turned-PML authors will then have the option of writing an > > osc/YOUR_COMPONENT for doing optimized MPI-one-sided operations on > > your network. > > > > This is what we would like to discuss with you tomorrow. Tell us that > this idea > > is crazy, or that it's ok, or that you need to think about it, ...etc. > Let's chat. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open- > > mpi.org/community/lists/devel/2015/01/16836.php > > > ------------------------------ > > Message: 5 > Date: Thu, 29 Jan 2015 15:02:25 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Developers List <de...@open-mpi.org> > Subject: [OMPI devel] Webex for this morning > Message-ID: <bc07da89-30cb-4dde-b8ee-efb46c2d8...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > We're just starting this morning. We've joined a running webex if anyone > feels like joining. Here's what we'll be talking about this morning: > > > https://github.com/open-mpi/ompi/wiki/Meeting-2015-01#topics-still-to-discuss > > The MTL discussion will be at 1pm US Central today. It'll *probably* be > the same webex link. I'll send out whatever the correct webex link is -- > even if it's the same one -- slightly before the 1pm US Central start time > today. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Message: 6 > Date: Thu, 29 Jan 2015 15:04:05 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Developers List <de...@open-mpi.org> > Subject: Re: [OMPI devel] For discussion tomorrow: MTL issues > Message-ID: <422d01eb-a42b-4f8f-b33d-a5c577217...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > > On Jan 29, 2015, at 8:52 AM, Friedley, Andrew <andrew.fried...@intel.com> > wrote: > > > > Is there anything written up about recent Open MPI one-sided work? > Looking for something beyond just the code that I can study up on.. papers, > design docs, future plans, etc. > > Doubtful. > > I think the main intent for the original discussion is that Nathan had > some ideas about extending the MTL interface to include some one-sided > functionality so MPI_PUT/MPI_GET/etc. could turn into nice native RMA > support in MTLs as well. > > That discussion, however, combined with the MXM/Yalla discussion, turned > into the ideas / email I sent yesterday. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Message: 7 > Date: Thu, 29 Jan 2015 15:06:35 +0000 > From: "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> > To: Open MPI Developers List <de...@open-mpi.org> > Subject: Re: [OMPI devel] Webex for this morning > Message-ID: <c9328216-ae7f-4572-82e5-6f03c4fde...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > Sigh. That's the wiki link, not the webex link. :-) > > Here's the webex link for this morning: > > https://cisco.webex.com/cisco/e.php?MTID=m5da65867500cfd51e7a1ed895b2e2a8d > > > > On Jan 29, 2015, at 9:02 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > > We're just starting this morning. We've joined a running webex if > anyone feels like joining. Here's what we'll be talking about this morning: > > > > > https://github.com/open-mpi/ompi/wiki/Meeting-2015-01#topics-still-to-discuss > > > > The MTL discussion will be at 1pm US Central today. It'll *probably* be > the same webex link. I'll send out whatever the correct webex link is -- > even if it's the same one -- slightly before the 1pm US Central start time > today. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16842.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ------------------------------ > > End of devel Digest, Vol 2905, Issue 1 > ************************************** > -- Work: davetur...@ksu.edu (785) 532-7791 118 Nichols Hall, Manhattan KS 66502 Home: drdavetur...@gmail.com cell: (785) 770-5929