On 10/06/2015 05:49 PM, Or Gerlitz wrote: > On Wed, Oct 7, 2015 at 12:26 AM, Doug Ledford <[email protected]> wrote: > >> Nothing so simple unfortunately. And it isn't an IB/RoCE cluster, it's >> IB/IB/OPA/RoCE/IWARP cluster. Regardless though, that's not my problem >> and what I'm chasing. > > To be precise no two transports out of IB/RoCE/iWARP/OPA are > inter-operable, so these are "just" different cards/transports under > the same IB core on this cluster.
Except that some machines have links to as many as four of the different fabrics and so a problem in one can effect testing of others. >> Yes, I know how to do DOA testing. > > So what's dead in your env after (say) 59m of examination? It's not dead after 59m, it's DOA immediately. And it's iSER. But the details are much more complex than iSER is DOA. It was DOA when running a rhel7 kernel (internal, for next kernel, not a release kernel). That kernel is pretty close to upstream. When I went to put an upstream kernel on there to see if it had the same issue, the upstream kernel on that machine oopses on boot. It oopses in list_add, but the backtrace doesn't list any usable information about who called list_add with bogus data. However, reliably, right before the oops, the ciostor driver fails to load properly, so I'm going with that being the likely culprit. But each iteration is slow because when the rhel7 kernel iSER does it's thing, it causes a hung reboot, but it also crashes the iDRAC in the machine (errant drivers crashing a baseboard management controller is never a good sign), so the reboot must be done via a hard power cycle. When the upstream kernel oopses on boot, at least the iDRAC is still working. As a result, each test iteration is pretty slow. There we go, bootup on a 4.3-rc4 kernel with cxgb4 FCoE driver disabled succeeded. A hurdle passed. Now I can test upstream iSER. With an upstream kernel, the drive is still read-only with iSER (it's not configured that way to the best of my knowledge, but I'm using auto generate ACLs, I'm getting ready to switch the system to specific ACLs instead), but the thread isn't stuck in D state, so that's an improvement. However, the machine is still crashing the iDRAC on reboot. I can't be certain if it's the SRP target or iSER target causing this as they both were brought up live at the same time and reboot cycles without either of these work fine. So I have more investigation to go before I know exactly what's going on. And as I pointed out, each iteration is slow :-/ >>> What we do know that needs fixing for 4.3-rc >>> --> RoCE, you need the patch re-posted by Haggai few hours ago >>> "IB/cma: Accept connection without a valid netdev on RoCE" -- without >>> it, RoCE isn't working. > >> I have that already. It's available on both github and k.o and just >> waiting for a pull request. > > Maybe wait to get the fixes for the non-default pkey on mlx5 (see more below)? > > Did you actually note that before Haggai posted the patch?! No. > once I realized how deep was the breakage, I became sort of very > worried re your testing env not shouting hard on us this something is > broken even before 4.3-rc1 My test environment has been down for upgrades. In the last little bit we've brought a second rack online, added 10 new machines, 3 new switches, and moved existing machines around between the two racks in order to more evenly balance the need for each port type across switches in the two racks. There's been more than that going on behind the scenes here too, but it's not really worth getting into all of it. Suffice it to say I've been working on A) expanding the cluster, B) expanding the things the cluster is configured to do and therefore able to test, and C) finding a way to get upstream code into this testing framework since it was previously all rhel/fedora centric. And this test infrastructure goes down by COB Thursday of this week and won't be back for a week because it's being used for NFSoRDMA testing at this fall's Bake-a-thon. >>> --> **mlx5** devices and no-default IB pkeys, Haggai and Co are >>> working on a fix since this isn't working since 4.3-rc1. I told them >>> we need it till rc5.5 (i.e few days before rc6 and if not, will have >>> to revert some 4.3-rc1 bits. > >> I already have on patch related to this in my repo as well. The 0day >> testing just came back and it's all good. > > I suspect that you don't... I meant build tests passed, not run tests. > do you have rping up and running between > mlx4 and mlx5 on non default pkey? the breakage is a bit tricky and > you might not see it if you run mlx5 against mlx5, BTW which patch is > that? -- Doug Ledford <[email protected]> GPG KeyID: 0E572FDD
signature.asc
Description: OpenPGP digital signature
