Re: [Open-FCoE] [Patch] open-fcoe-target patch for latest kernel branch

Joe Eykholt Wed, 18 Mar 2009 12:24:05 -0700

Mike Christie wrote:
> Joe Eykholt wrote:
>> Joe Eykholt wrote:
>>> Sivaram Kannan wrote:
>>>> Hi all,
>>>>
>>>> I have created a patch for Open-Fcoe-Target which could compile
> against the latest kernel test
>>> branch(linux-2.6.29-rc8). The same patch will also compile against
> the latest stable version 2.6.28. I exposed the
>>> target and when I triggered the initiator, I could see sequence of
> SCSI packets at the wireshark. I did not understand
>>> the messages completely, I am decoding it. One problem that I noticed
> during the testing is with the initiator. For
>>> initiator I compiled the latest 2.6.29-rc8 in Fedora 10, and when I
> give the command "echo "ethX" >
>>> /sys/module/fcoe/parameters/create", I am getting a kernel opps.
> There is no trouble at the target end, wireshark was
>>> still printing SCSI packets. Would attaching the wireshark trace from
> target end would be helpful to identify what is
>>> the problem?
>>>
>>> I see a problem, too.  Maybe its the same one, or maybe another one.
>>>
>>> With no LUNs exported from the target, the latest initiator
> (fcoe-features.git as of 3/17/09
>>> just after the FIP patches were integrated) I get this crash:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
>>> IP: [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>>> PGD 7bc33067 PUD 7bc64067 PMD 0
>>> Oops: 0000 [#1] SMP
>>> last sysfs file: /sys/class/fc_remote_ports/rport-6:0-0/roles
>>> CPU 0
>>> Modules linked in: fcoe libfcoe libfc ata_generic ata_piix e1000
>>> Pid: 4213, comm: fcoethread/0 Tainted: G        W  2.6.29-rc7-feat2
> #1 PowerEdge SC1425
>>> RIP: 0010:[<ffffffffa003b733>]  [<ffffffffa003b733>]
> fc_fcp_resp+0xe7/0x1fc [libfc]
>>> RSP: 0018:ffff88007249dd80  EFLAGS: 00010246
>>> RAX: 0000000000000000 RBX: ffff88007dd7c060 RCX: 0000000000000060
>>> RDX: 0000000000000002 RSI: ffff880075c5d316 RDI: ffff88007242b748
>>> RBP: ffff88007249ddc0 R08: 000000000000006c R09: 0000000000000060
>>> R10: 0000000000000000 R11: 0000000000000046 R12: ffff880075c5d2fe
>>> R13: ffff88007dd7c002 R14: 0000000000000084 R15: 0000000000000000
>>> FS:  0000000000000000(0000) GS:ffffffff808b2080(0000)
> knlGS:0000000000000000
>>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>> CR2: 0000000000000088 CR3: 000000007d1c7000 CR4: 00000000000006e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process fcoethread/0 (pid: 4213, threadinfo ffff88007249c000, task
> ffff8800724ea0c0)
>>> Stack:
>>>  ffff88007dd7c070 ffff88007dc1ce00 ffff88007dc1ce00 ffff88007dd7c060
>>>  ffff88007dc1ce00 ffff88007dd7c070 ffff88007dd802c8 ffffffffa003b848
>>>  ffff88007249ddf0 ffffffffa003b8d6 0000000000980000 ffff88007dd801b8
>>> Call Trace:
>>>  [<ffffffffa003b848>] ? fc_tm_done+0x0/0xc6 [libfc]
>>>  [<ffffffffa003b8d6>] fc_tm_done+0x8e/0xc6 [libfc]
>>>  [<ffffffffa00344f7>] fc_exch_recv+0x89a/0xdcf [libfc]
>>>  [<ffffffffa004ea92>] fcoe_percpu_receive_thread+0x382/0x3a0 [fcoe]
>>>  [<ffffffffa004e710>] ? fcoe_percpu_receive_thread+0x0/0x3a0 [fcoe]
>>>  [<ffffffff8024b7ef>] kthread+0x49/0x76
>>>  [<ffffffff8020c63a>] child_rip+0xa/0x20
>>>  [<ffffffff8020c03c>] ? restore_args+0x0/0x30
>>>  [<ffffffff8024b7a6>] ? kthread+0x0/0x76
>>>  [<ffffffff8020c630>] ? child_rip+0x0/0x20
>>> Code: 01 00 00 e8 c0 14 1f e0 e9 24 01 00 00 f6 c2 02 74 24 8b 40 04
> 41 b9 60 00 00 00 0f c8 83 f8 60 44 0f 46 c8 48 8b
>>> 43 40 44 89 c9 <48> 8b b8 88 00 00 00 f3 a4 eb 03 45 31 c9 f6 c2 0c
> 74 49 31 c0
>>> RIP  [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>>>  RSP <ffff88007249dd80>
>>> CR2: 0000000000000088
>>> ---[ end trace 0be88afb67690d86 ]---
>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>
>>>
>>> The fsp has a NULL cmd pointer in fc_fcp_resp() in this code:
>>>
>>>                         if (flags & FCP_SNS_LEN_VAL) {
>>>                                 snsl = ntohl(rp_ex->fr_sns_len);
>>>                                 if (snsl > SCSI_SENSE_BUFFERSIZE)
>>>                                         snsl = SCSI_SENSE_BUFFERSIZE;
>>> ==> blew up here since fsp->cmd is NULL.
>>>                                 memcpy(fsp->cmd->sense_buffer,
>>>                                        (char *)fc_rp_info + respl, snsl);
>>>                         }
>>>
>>> For those that like assembly:
>>>
>>>     1ae3:       0f c8                   bswap  %eax
>>>     1ae5:       83 f8 60                cmp    $0x60,%eax
>>>     1ae8:       44 0f 46 c8             cmovbe %eax,%r9d
>>>     1aec:       48 8b 43 40             mov    0x40(%rbx),%rax
>>>     1af0:       44 89 c9                mov    %r9d,%ecx
>>> ----- blew up here fetching pointer to memcpy destination.
>>>     1af3:       48 8b b8 88 00 00 00    mov    0x88(%rax),%rdi
>>>     1afa:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
>>>
>>> The trace shows a check condition response to a Test Unit Ready.
>>> I haven't looked further.  I don't know why fsp->cmd is NULL.
>>>
>>>      Joe
>>
>> In case this helps, here's a trace.  There were some timeouts so maybe
> that
>> leads to the fsp->cmd being NULL for that final test-unit-ready.
>>
>> ./tshark -n -r ~/ws/traces/jre.cap
>> Running as user "root" and group "root". This could be dangerous.
>>
>>                    S_ID        D_ID         OX_ID  RX_ID
>>   1   0.000000     00.00.00 -> ff.ff.fe     0x67f 0xffff FC ELS FLOGI
>>   2   1.999918     00.00.00 -> ff.ff.fe     0x680 0xffff FC ELS FLOGI
>>   3   3.999852     00.00.00 -> ff.ff.fe     0x681 0xffff FC ELS FLOGI
>>   4   5.999783     00.00.00 -> ff.ff.fe     0x682 0xffff FC ELS FLOGI
>>   5   7.999716     00.00.00 -> ff.ff.fe     0x683 0xffff FC ELS FLOGI
>>   6   9.999650     00.00.00 -> ff.ff.fe     0x684 0xffff FC ELS FLOGI
>>   7  11.999582     00.00.00 -> ff.ff.fe     0x685 0xffff FC ELS FLOGI
>>   8  13.999516     00.00.00 -> ff.ff.fe     0x686 0xffff FC ELS FLOGI
>>   9  14.753934 00:14:22:72:67:eb -> 01:10:18:01:00:02 0x0 0x0 FIP
> Solicitation
>>  10  15.999447     00.00.00 -> ff.ff.fe     0x687 0xffff FC ELS FLOGI
>>  11  15.999761     ff.ff.fe -> 00.00.00     0x687 0x2 FC ELS ACC (FLOGI)
>>  12  16.003827     01.01.01 -> 01.01.02     0x3 0x688 FC ELS PLOGI
>>  13  16.003845     01.01.01 -> 01.01.02     0x4 0x689 FC ELS PLOGI
>>  14  16.003857     01.01.01 -> ff.ff.fc     0x5 0x68a dNS GPN_FT
>>
>> Why are we doing discovery in a point-to-point setup?  We should skip
> that.
>> Note that this is from the initiator.  The old code in the target
> correctly
>> doesn't do discovery.  This may be harmless, though.
>>
>>  15  16.010592     01.01.02 -> 01.01.01     0x3 0x688 FC ELS ACC (PLOGI)
>>  16  16.010617     01.01.02 -> 01.01.01     0x4 0x689 FC ELS ACC (PLOGI)
>>  17  16.010813     01.01.01 -> 01.01.02     0x6 0xffff FC ELS PRLI
>>  18  16.011348     01.01.02 -> 01.01.01     0x6 0x68b FC ELS ACC (PRLI)
>>  19  16.011495     01.01.01 -> 01.01.02     0x7 0xffff FC ELS RTV
>>  20  16.011606     01.01.02 -> 01.01.01     0x7 0x68c FC ELS LS_RJT (RTV)
>>  21  16.013032     01.01.01 -> 01.01.02     0x8 0xffff FCP SCSI:
> Inquiry LUN: 0x00
>>  22  16.021367     01.01.02 -> 01.01.01     0x8 0x68d FCP SCSI:
> Response LUN: 0x00 (Inquiry) (Check Condition) LUN:0x00
>>  23  16.028854     01.01.01 -> 01.01.02     0x9 0xffff FC ELS REC
>>
>> REC is asking about OX_ID 8.  Not sure why we sent an REC here,
>> the response should be the end of that exchange.
>>
>> We reject the incoming REC.  Why?  It says 'command not supported'
>> but should say invalid OX_ID or something indicating the XID is complete.
>>
>>  24  16.028894     01.01.02 -> 01.01.01     0x9 0x68e FC ELS LS_RJT (REC)
>>  25  17.999900     01.01.01 -> ff.ff.fc     0x5 0xffff BLS ABTS
>>
>> discovery timed out.
>>
>>  26  22.256120     01.01.01 -> 01.01.02     0x8 0x68d BLS ABTS
>>
>> The above ABTS seems like a bug ... we should've seen the response
> above.  Why was the REC sent?
>> 10 seconds later, the Test Unit Ready goes out and we die after the
> response.  This seems pretty
>> normal and unrelated to the previous stuff.
>>
>>  27  32.256311     01.01.01 -> 01.01.02     0xa 0xffff FCP [FCP
> LU_RESET] SCSI: Test Unit Ready LUN: 0x00
> 
> I did not grep this output.


You mean 'grok'? :-)   This is tshark/wireshark output.
I see what you mean, it's ambiguous.

I guess it's really a LU reset.

The task management flags in the FCP header are 0x10, LU reset.
The opcode was 0 which I guess is irrelevant, but tshark interprets
it as normal: test unit ready.  So does the old openfc target code,
it turns out.  It's a very bare bones implementation which doesn't
handle this correctly.

> Is this is a lu reset TMF or a TUR?
> 
>>  28  32.264601     01.01.02 -> 01.01.01     0xa 0x68f FCP SCSI:
> Response LUN: 0x00 (Test Unit Ready) (Check Condition)
>> LUN:0x00
>>
> 
> if it is a lu reset TMF then the fsp would not have a scsi command set.
> Then if we did get a check condition on a TMF then we would hit that
> null ptr since we do not expect to ever get a check condition with sense
> for a tmf.

Three bugs revealed here:
        - fcp crashes from the sense.
        - tshark interprets irrelevant opcode in both request and reply.
        - target code sends sense on LU reset.

I'm only planning to fix the first of these for now.  I'll generate a patch.

> It would be weird to be sending a TUR when we have no devices. I do not
> think scsi_scan.c will do this. Well, it would be weird to be sedning a
> lu reset when we have no device too, so something is screwy.
> 
> I think we can get to the lu reset case though if one of the scsi_scan.c
> IOs timesout. It looks like a device might be on the host __devices list
>   (scsi_alloc_sdev will call scsi_sysfs_device_initialize and that adds
> it to the __devices list a little early (it adds it before we scan and
> not when the device is actually known to be there). So when
> scsi_eh_bus_device_reset runs we would try to reset it even though it is
> not setup/known.

Thanks a lot, Mike.  I figured you'd be able to shed light on this.

        Joe

> 
> 
>> I don't see how fsp->cmd can be NULL in this case.
>> Maybe someone with more recent familiarity with fc_fcp.c will see the
> problem.
>>
>>       Cheers,
>>       Joe
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-fcoe.org/mailman/listinfo/devel
> 

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] [Patch] open-fcoe-target patch for latest kernel branch

Reply via email to