[ +Cc Hannes ]

On Thu, Aug 09, 2018 at 04:41:24PM +0200, ard wrote:
> Hi,
> 
> On Thu, Aug 09, 2018 at 12:01:30PM +0200, ard wrote:
> > This to determine if we have a single regression in just the
> > login handling or both.
> As a matter of fact, I think this will not work on vn2vn:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/libfc?id=386b97b43c0c9e0d878eec7ea1db16af22b036ae
> scsi: libfc: Rework PRLI handling
> <theorising>
> As it clearly rejects the PRLI and eventually sends a LOGO when
> no store is associated.
> But as san setup goes: it first attaches the network driver to
> the fcoe layer and hence is store less. Then it attaches stores
> to export. But at that moment all possible initiators already got
> a LOGO
> </theorising>
> 
> But looking back at the logs, it is not all that bad as what I
> said.  It's not erroring ad infinitum, it's: 
> 
> root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 
> 2018-08-08-kern.log|cut -d\  -f9-|uniq -c
>       1   2.577320] kmemleak: Kernel memory leak detector initialized
>       1   2.577350] kmemleak: Automatic memory scanning thread started
>       1 136.452894] kmemleak: 1 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>       1 host10: fip: vn_add rport 00c76e new state 0
>       1 host10: fip: vn_add rport 00c76e old state 0
>       8 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 2 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 4 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>       2 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 47 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 2 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 47 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      52 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 50 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 47 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 55 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      52 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 46 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      50 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 46 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>      36 host10: fip: vn_add rport 00c76e old state 4
>       1 kmemleak: 50 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>       1 kmemleak: 36 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>       1 kmemleak: 2 new suspected memory leaks (see 
> /sys/kernel/debug/kmemleak)
>       1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> 
> From the last few lines we can clearly see that de vn_add rport ...  old 
> state 4
> coincides with a kmemleak 10 minutes later. especially 50 and 36. Now the 
> exact
> dump follows for that.
> 
> Now to get back to what is the difference:
> This is a working login with a 4.9:
> 
> Aug  8 10:53:15 localhost kernel: [   14.929451] host10: fip: vn_add rport 
> 0004e0 old state 0
> Aug  8 10:53:21 localhost kernel: [   23.143274] host10: fip: beacon from 
> rport 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143275] host10: fip: beacon expired 
> for rport 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143276] host10: rport 0004e0: Login 
> to port
> Aug  8 10:53:21 localhost kernel: [   23.143277] host10: rport 0004e0: 
> Entered FLOGI state from Init state
> Aug  8 10:53:21 localhost kernel: [   23.143294] host10: fip: els_send op 7 
> d_id 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143301] host10: fip: beacon from 
> rport 4e0
> Aug  8 10:53:21 localhost kernel: [   23.143920] host10: rport 0004e0: 
> Received a FLOGI accept
> Aug  8 10:53:21 localhost kernel: [   23.143921] host10: rport 0004e0: Port 
> entered PLOGI state from FLOGI state
> Aug  8 10:53:21 localhost kernel: [   23.143956] host10: rport 0004e0: 
> Received a PLOGI accept
> Aug  8 10:53:21 localhost kernel: [   23.143958] host10: rport 0004e0: Port 
> entered PRLI state from PLOGI state
> Aug  8 10:53:21 localhost kernel: [   23.143982] host10: rport 0004e0: 
> Received a PRLI accept
> Aug  8 10:53:21 localhost kernel: [   23.143984] host10: rport 0004e0: PRLI 
> spp_flags = 0x0 spp_type 0x20
> Aug  8 10:53:21 localhost kernel: [   23.143985] host10: rport 0004e0: Error 
> -6 in state PRLI, retrying
> Aug  8 10:53:21 localhost kernel: [   23.144434] host10: rport 0004e0: 
> Received PRLI request while in state PRLI
> Aug  8 10:53:21 localhost kernel: [   23.144445] host10: rport 0004e0: PRLI 
> rspp type 8 active 1 passive 0
> Aug  8 10:53:21 localhost kernel: [   23.749492] host10: rport 0004e0: 
> Received RTV request
> Aug  8 10:53:23 localhost kernel: [   25.204559] host10: rport 0004e0: Port 
> timeout, state PRLI
> Aug  8 10:53:23 localhost kernel: [   25.347400] host10: rport 0004e0: Port 
> entered PRLI state from PRLI state
> Aug  8 10:53:23 localhost kernel: [   25.393058] host10: rport 0004e0: 
> Received a PRLI accept
> Aug  8 10:53:23 localhost kernel: [   25.398412] host10: rport 0004e0: PRLI 
> spp_flags = 0x21 spp_type 0x8
> Aug  8 10:53:23 localhost kernel: [   25.404853] host10: rport 0004e0: Port 
> entered RTV state from PRLI state
> Aug  8 10:53:23 localhost kernel: [   25.450291] host10: rport 0004e0: 
> Received a RTV reject
> Aug  8 10:53:23 localhost kernel: [   25.455620] host10: rport 0004e0: Port 
> is Ready
> Aug  8 10:53:23 localhost kernel: [   25.460185] host10: rport 0004e0: work 
> event 1
> Aug  8 10:53:23 localhost kernel: [   25.464672] host10: rport 0004e0: No 
> rport!
> Aug  8 10:53:23 localhost kernel: [   25.469577] host10: rport 0004e0: 
> callback ev 1
> Aug  8 10:53:23 localhost kernel: [   25.474152] host10: fip: 
> vn_rport_callback 4e0 event 1
> Aug  8 10:53:30 localhost kernel: [   31.850891] host10: fip: beacon from 
> rport 4e0
> 
> This is the error retry with a 4.16 unpatched:
> 
> Aug  8 11:33:50 localhost kernel: [ 2452.565524] host10: fip: 
> vn_claim_notify: send reply to c76e
> Aug  8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 
> 00c76e new state 0
> Aug  8 11:33:50 localhost kernel: [ 2452.576898] host10: fip: 
> vn_claim_notify: send reply to c76e
> Aug  8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 
> 00c76e old state 0
> Aug  8 11:33:57 localhost kernel: [ 2458.854081] host10: fip: beacon from 
> rport c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.858622] host10: fip: beacon expired 
> for rport c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.863760] host10: rport 00c76e: Login 
> to port
> Aug  8 11:33:57 localhost kernel: [ 2458.868455] host10: rport 00c76e: 
> Entered FLOGI state from Init state
> Aug  8 11:33:57 localhost kernel: [ 2458.875177] host10: fip: els_send op 7 
> d_id c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.886089] host10: fip: beacon from 
> rport c76e
> Aug  8 11:33:57 localhost kernel: [ 2458.896069] host10: rport 00c76e: 
> Received a FLOGI accept
> Aug  8 11:33:57 localhost kernel: [ 2458.901716] host10: rport 00c76e: Port 
> entered PLOGI state from FLOGI state
> Aug  8 11:33:57 localhost kernel: [ 2458.934161] host10: rport 00c76e: 
> Received a PLOGI accept
> Aug  8 11:33:57 localhost kernel: [ 2458.939651] host10: rport 00c76e: Port 
> entered PRLI state from PLOGI state
> Aug  8 11:33:57 localhost kernel: [ 2458.985427] host10: rport 00c76e: 
> Received a PRLI reject
> Aug  8 11:33:57 localhost kernel: [ 2458.990817] host10: rport 00c76e: PRLI 
> ELS rejected, reason 5 expl 0
> Aug  8 11:33:57 localhost kernel: [ 2458.997255] host10: rport 00c76e: Error 
> 5 in state PRLI, retrying
> Aug  8 11:33:59 localhost kernel: [ 2461.060167] host10: rport 00c76e: Port 
> timeout, state PRLI
> Aug  8 11:33:59 localhost kernel: [ 2461.065688] host10: rport 00c76e: Port 
> entered PRLI state from PRLI state
> Aug  8 11:33:59 localhost kernel: [ 2461.111048] host10: rport 00c76e: 
> Received a PRLI reject
> Aug  8 11:33:59 localhost kernel: [ 2461.116388] host10: rport 00c76e: PRLI 
> ELS rejected, reason 9 expl 1e
> Aug  8 11:33:59 localhost kernel: [ 2461.122927] host10: rport 00c76e: Error 
> 5 in state PRLI, retrying
> Aug  8 11:34:01 localhost kernel: [ 2463.172152] host10: rport 00c76e: Port 
> timeout, state PRLI
> Aug  8 11:34:01 localhost kernel: [ 2463.177696] host10: rport 00c76e: Port 
> entered PRLI state from PRLI state
> Aug  8 11:34:01 localhost kernel: [ 2463.224578] host10: rport 00c76e: 
> Received a PRLI reject
> Aug  8 11:34:01 localhost kernel: [ 2463.230023] host10: rport 00c76e: PRLI 
> ELS rejected, reason 9 expl 1e
> Aug  8 11:34:01 localhost kernel: [ 2463.236765] host10: rport 00c76e: Error 
> 5 in state PRLI, retrying
> Aug  8 11:34:03 localhost kernel: [ 2465.283937] host10: rport 00c76e: Port 
> timeout, state PRLI
> Aug  8 11:34:03 localhost kernel: [ 2465.289424] host10: rport 00c76e: Port 
> entered PRLI state from PRLI state
> Aug  8 11:34:03 localhost kernel: [ 2465.335150] host10: rport 00c76e: 
> Received a PRLI reject
> Aug  8 11:34:03 localhost kernel: [ 2465.340540] host10: rport 00c76e: PRLI 
> ELS rejected, reason 9 expl 1e
> Aug  8 11:34:03 localhost kernel: [ 2465.347068] host10: rport 00c76e: Error 
> -5 in state PRLI, retries 3
> Aug  8 11:34:03 localhost kernel: [ 2465.353411] host10: rport 00c76e: Port 
> sending LOGO from PRLI state
> Aug  8 11:34:03 localhost kernel: [ 2465.359760] host10: fip: els_send op 9 
> d_id c76e
> Aug  8 11:34:05 localhost kernel: [ 2467.046011] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:05 localhost kernel: [ 2467.050549] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:09 localhost kernel: [ 2470.857393] host10: fip: 
> vn_claim_notify: send reply to c76e
> Aug  8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 
> 00c76e old state 4
> Aug  8 11:34:09 localhost kernel: [ 2470.868764] host10: fip: 
> vn_claim_notify: send reply to c76e
> Aug  8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 
> 00c76e old state 4
> Aug  8 11:34:13 localhost kernel: [ 2475.238121] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:13 localhost kernel: [ 2475.242660] host10: fip: beacon expired 
> for rport c76e
> Aug  8 11:34:13 localhost kernel: [ 2475.247804] host10: rport 00c76e: port 
> already started
> 
> Repeat:
> 
> Aug  8 11:34:13 localhost kernel: [ 2475.253076] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:21 localhost kernel: [ 2483.429903] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:21 localhost kernel: [ 2483.434500] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:29 localhost kernel: [ 2491.621938] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:29 localhost kernel: [ 2491.626478] host10: fip: beacon from 
> rport c76e
> 
> This part contains a memory leak after a timeout:
> Aug  8 11:34:33 localhost kernel: [ 2495.433102] host10: fip: 
> vn_claim_notify: send reply to c76e
> Aug  8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 
> 00c76e old state 4
> Aug  8 11:34:33 localhost kernel: [ 2495.444266] host10: fip: 
> vn_claim_notify: send reply to c76e
> Aug  8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 
> 00c76e old state 4
> Aug  8 11:34:38 localhost kernel: [ 2499.813722] host10: fip: beacon from 
> rport c76e
> Aug  8 11:34:38 localhost kernel: [ 2499.818256] host10: fip: beacon expired 
> for rport c76e
> Aug  8 11:34:38 localhost kernel: [ 2499.823402] host10: rport 00c76e: port 
> already started
> 
> You can find the raw kmemleak and kern.log here:
> https://github.com/hardkernel/linux/issue_comments#issuecomment-411772096
> 
> And the some filtered logs a bit more up.
> 
> The logs when it all went ok:
> https://github.com/hardkernel/linux/issues/360#issuecomment-411335647
> Especially:
> Aug  7 17:54:30 localhost kernel: [   15.540607] host10: fip: claim resp from 
> from rport c76e - state VNMP_UP
> Aug  7 17:54:30 localhost kernel: [   15.548710] host10: fip: vn_add rport 
> 00c76e new state 0
> Aug  7 17:54:30 localhost kernel: [   15.555463] host10: fip: claim resp from 
> from rport c76e - state VNMP_UP
> Aug  7 17:54:30 localhost kernel: [   15.563584] host10: fip: vn_add rport 
> 00c76e old state 0
> Aug  7 17:54:30 localhost kernel: [   15.696327] host10: fip: beacon expired 
> for rport c76e
> Aug  7 17:54:30 localhost kernel: [   15.702975] host10: rport 00c76e: Login 
> to port
> Aug  7 17:54:30 localhost kernel: [   15.707617] host10: rport 00c76e: 
> Entered FLOGI state from Init state
> Aug  7 17:54:30 localhost kernel: [   15.714224] host10: fip: els_send op 7 
> d_id c76e
> Aug  7 17:54:30 localhost kernel: [   15.736741] host10: rport 00c76e: 
> Received a FLOGI accept
> Aug  7 17:54:30 localhost kernel: [   15.742572] host10: rport 00c76e: Port 
> entered PLOGI state from FLOGI state
> Aug  7 17:54:30 localhost kernel: [   15.749695] host10: rport 00c76e: 
> Received a PLOGI accept
> Aug  7 17:54:30 localhost kernel: [   15.749696] host10: rport 00c76e: Port 
> entered PRLI state from PLOGI state
> Aug  7 17:54:30 localhost kernel: [   15.750112] host10: rport 00c76e: 
> Received a PRLI reject
> Aug  7 17:54:30 localhost kernel: [   15.750113] host10: rport 00c76e: PRLI 
> ELS rejected, reason 9 expl 1e
> Aug  7 17:54:30 localhost kernel: [   15.750114] host10: rport 00c76e: Error 
> 5 in state PRLI, retrying
> Aug  7 17:54:30 localhost kernel: [   15.750360] host10: rport 00c76e: 
> Received PRLI request while in state PRLI
> Aug  7 17:54:30 localhost kernel: [   15.750371] host10: rport 00c76e: PRLI 
> rspp type 8 active 1 passive 0
> Aug  7 17:54:30 localhost kernel: [   15.750570] host10: rport 00c76e: 
> Received RTV request
> Aug  7 17:54:30 localhost kernel: [   17.758897] host10: rport 00c76e: Port 
> timeout, state PRLI
> Aug  7 17:54:30 localhost kernel: [   17.764559] host10: rport 00c76e: Port 
> entered PRLI state from PRLI state
> Aug  7 17:54:30 localhost kernel: [   17.812081] host10: rport 00c76e: 
> Received a PRLI accept
> Aug  7 17:54:31 localhost kernel: [   17.818809] host10: rport 00c76e: PRLI 
> spp_flags = 0x21 spp_type 0x8
> Aug  7 17:54:31 localhost kernel: [   17.826634] host10: rport 00c76e: Port 
> entered RTV state from PRLI state
> Aug  7 17:54:31 localhost kernel: [   17.878821] host10: rport 00c76e: 
> Received a RTV accept
> Aug  7 17:54:31 localhost kernel: [   17.885478] host10: rport 00c76e: Port 
> is Ready
> Aug  7 17:54:31 localhost kernel: [   17.891513] host10: rport 00c76e: work 
> event 1
> Aug  7 17:54:31 localhost kernel: [   17.897386] host10: rport 00c76e: No 
> rport!
> Aug  7 17:54:31 localhost kernel: [   17.903698] host10: rport 00c76e: 
> callback ev 1
> Aug  7 17:54:31 localhost kernel: [   17.909081] host10: fip: 
> vn_rport_callback c76e event 1
> Aug  7 17:54:34 localhost kernel: [   20.992106] host10: rport 00c76e: 
> Received ADISC request
> 
> So in the "working" case I see that the other node (4.16 unpatched) accepts 
> our
> PRLI, if I interpret that right, and in the other case it
> doesn't. Same system, different boot order though, but I guess
> the boot order doesn't matter. 
> When it doesn't work, it gets or sends a mutual LOGO and gets into a weird 
> state.
> 
> Anyway: as far as I interpreted it, in the vn2vn situation alle
> enodes should know all other rports to prevent a double rport
> assignment. I have no experience with FC itself though, so I
> don't know what the normal state is but sending a LOGO due to a
> failure to set up, is that ok?
> Or is the vn2vn rport maintenance on another layer?
> 
> The 4.9 kernel is still to be baked...
> I expect it to yield minor kmemleaks, until I start toggling the
> FCoE.
> 
> Regards,
> Ard van Breemen
> 
> PS: I removed some cross spamming from the Cc:, I hope that's ok.

Full quote for reference. This looks like the PRLI rework from Hannes
introduced the memleaks here.

Hannes can you have a look at it?

-- 
Johannes Thumshirn                                          Storage
jthumsh...@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Reply via email to