Re: [Open-FCoE] [Patch] open-fcoe-target patch for latest kernel branch

Mike Christie Wed, 18 Mar 2009 11:05:23 -0700

Joe Eykholt wrote:
> Joe Eykholt wrote:
>> Sivaram Kannan wrote:
>>> Hi all,
>>>
>>> I have created a patch for Open-Fcoe-Target which could compile against the 
>>> latest kernel test
>> branch(linux-2.6.29-rc8). The same patch will also compile against the 
>> latest stable version 2.6.28. I exposed the
>> target and when I triggered the initiator, I could see sequence of SCSI 
>> packets at the wireshark. I did not understand
>> the messages completely, I am decoding it. One problem that I noticed during 
>> the testing is with the initiator. For
>> initiator I compiled the latest 2.6.29-rc8 in Fedora 10, and when I give the 
>> command "echo "ethX" >
>> /sys/module/fcoe/parameters/create", I am getting a kernel opps. There is no 
>> trouble at the target end, wireshark was
>> still printing SCSI packets. Would attaching the wireshark trace from target 
>> end would be helpful to identify what is
>> the problem?
>>
>> I see a problem, too.  Maybe its the same one, or maybe another one.
>>
>> With no LUNs exported from the target, the latest initiator 
>> (fcoe-features.git as of 3/17/09
>> just after the FIP patches were integrated) I get this crash:
>>
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
>> IP: [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>> PGD 7bc33067 PUD 7bc64067 PMD 0
>> Oops: 0000 [#1] SMP
>> last sysfs file: /sys/class/fc_remote_ports/rport-6:0-0/roles
>> CPU 0
>> Modules linked in: fcoe libfcoe libfc ata_generic ata_piix e1000
>> Pid: 4213, comm: fcoethread/0 Tainted: G        W  2.6.29-rc7-feat2 #1 
>> PowerEdge SC1425
>> RIP: 0010:[<ffffffffa003b733>]  [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc 
>> [libfc]
>> RSP: 0018:ffff88007249dd80  EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: ffff88007dd7c060 RCX: 0000000000000060
>> RDX: 0000000000000002 RSI: ffff880075c5d316 RDI: ffff88007242b748
>> RBP: ffff88007249ddc0 R08: 000000000000006c R09: 0000000000000060
>> R10: 0000000000000000 R11: 0000000000000046 R12: ffff880075c5d2fe
>> R13: ffff88007dd7c002 R14: 0000000000000084 R15: 0000000000000000
>> FS:  0000000000000000(0000) GS:ffffffff808b2080(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 0000000000000088 CR3: 000000007d1c7000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process fcoethread/0 (pid: 4213, threadinfo ffff88007249c000, task 
>> ffff8800724ea0c0)
>> Stack:
>>  ffff88007dd7c070 ffff88007dc1ce00 ffff88007dc1ce00 ffff88007dd7c060
>>  ffff88007dc1ce00 ffff88007dd7c070 ffff88007dd802c8 ffffffffa003b848
>>  ffff88007249ddf0 ffffffffa003b8d6 0000000000980000 ffff88007dd801b8
>> Call Trace:
>>  [<ffffffffa003b848>] ? fc_tm_done+0x0/0xc6 [libfc]
>>  [<ffffffffa003b8d6>] fc_tm_done+0x8e/0xc6 [libfc]
>>  [<ffffffffa00344f7>] fc_exch_recv+0x89a/0xdcf [libfc]
>>  [<ffffffffa004ea92>] fcoe_percpu_receive_thread+0x382/0x3a0 [fcoe]
>>  [<ffffffffa004e710>] ? fcoe_percpu_receive_thread+0x0/0x3a0 [fcoe]
>>  [<ffffffff8024b7ef>] kthread+0x49/0x76
>>  [<ffffffff8020c63a>] child_rip+0xa/0x20
>>  [<ffffffff8020c03c>] ? restore_args+0x0/0x30
>>  [<ffffffff8024b7a6>] ? kthread+0x0/0x76
>>  [<ffffffff8020c630>] ? child_rip+0x0/0x20
>> Code: 01 00 00 e8 c0 14 1f e0 e9 24 01 00 00 f6 c2 02 74 24 8b 40 04 41 b9 
>> 60 00 00 00 0f c8 83 f8 60 44 0f 46 c8 48 8b
>> 43 40 44 89 c9 <48> 8b b8 88 00 00 00 f3 a4 eb 03 45 31 c9 f6 c2 0c 74 49 31 
>> c0
>> RIP  [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>>  RSP <ffff88007249dd80>
>> CR2: 0000000000000088
>> ---[ end trace 0be88afb67690d86 ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>>
>>
>> The fsp has a NULL cmd pointer in fc_fcp_resp() in this code:
>>
>>                         if (flags & FCP_SNS_LEN_VAL) {
>>                                 snsl = ntohl(rp_ex->fr_sns_len);
>>                                 if (snsl > SCSI_SENSE_BUFFERSIZE)
>>                                         snsl = SCSI_SENSE_BUFFERSIZE;
>> ==> blew up here since fsp->cmd is NULL.
>>                                 memcpy(fsp->cmd->sense_buffer,
>>                                        (char *)fc_rp_info + respl, snsl);
>>                         }
>>
>> For those that like assembly:
>>
>>     1ae3:       0f c8                   bswap  %eax
>>     1ae5:       83 f8 60                cmp    $0x60,%eax
>>     1ae8:       44 0f 46 c8             cmovbe %eax,%r9d
>>     1aec:       48 8b 43 40             mov    0x40(%rbx),%rax
>>     1af0:       44 89 c9                mov    %r9d,%ecx
>> ----- blew up here fetching pointer to memcpy destination.
>>     1af3:       48 8b b8 88 00 00 00    mov    0x88(%rax),%rdi
>>     1afa:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
>>
>> The trace shows a check condition response to a Test Unit Ready.
>> I haven't looked further.  I don't know why fsp->cmd is NULL.
>>
>>      Joe
> 
> In case this helps, here's a trace.  There were some timeouts so maybe that
> leads to the fsp->cmd being NULL for that final test-unit-ready.
> 
> ./tshark -n -r ~/ws/traces/jre.cap
> Running as user "root" and group "root". This could be dangerous.
> 
>                    S_ID        D_ID         OX_ID  RX_ID
>   1   0.000000     00.00.00 -> ff.ff.fe     0x67f 0xffff FC ELS FLOGI
>   2   1.999918     00.00.00 -> ff.ff.fe     0x680 0xffff FC ELS FLOGI
>   3   3.999852     00.00.00 -> ff.ff.fe     0x681 0xffff FC ELS FLOGI
>   4   5.999783     00.00.00 -> ff.ff.fe     0x682 0xffff FC ELS FLOGI
>   5   7.999716     00.00.00 -> ff.ff.fe     0x683 0xffff FC ELS FLOGI
>   6   9.999650     00.00.00 -> ff.ff.fe     0x684 0xffff FC ELS FLOGI
>   7  11.999582     00.00.00 -> ff.ff.fe     0x685 0xffff FC ELS FLOGI
>   8  13.999516     00.00.00 -> ff.ff.fe     0x686 0xffff FC ELS FLOGI
>   9  14.753934 00:14:22:72:67:eb -> 01:10:18:01:00:02 0x0 0x0 FIP Solicitation
>  10  15.999447     00.00.00 -> ff.ff.fe     0x687 0xffff FC ELS FLOGI
>  11  15.999761     ff.ff.fe -> 00.00.00     0x687 0x2 FC ELS ACC (FLOGI)
>  12  16.003827     01.01.01 -> 01.01.02     0x3 0x688 FC ELS PLOGI
>  13  16.003845     01.01.01 -> 01.01.02     0x4 0x689 FC ELS PLOGI
>  14  16.003857     01.01.01 -> ff.ff.fc     0x5 0x68a dNS GPN_FT
> 
> Why are we doing discovery in a point-to-point setup?  We should skip that.
> Note that this is from the initiator.  The old code in the target correctly
> doesn't do discovery.  This may be harmless, though.
> 
>  15  16.010592     01.01.02 -> 01.01.01     0x3 0x688 FC ELS ACC (PLOGI)
>  16  16.010617     01.01.02 -> 01.01.01     0x4 0x689 FC ELS ACC (PLOGI)
>  17  16.010813     01.01.01 -> 01.01.02     0x6 0xffff FC ELS PRLI
>  18  16.011348     01.01.02 -> 01.01.01     0x6 0x68b FC ELS ACC (PRLI)
>  19  16.011495     01.01.01 -> 01.01.02     0x7 0xffff FC ELS RTV
>  20  16.011606     01.01.02 -> 01.01.01     0x7 0x68c FC ELS LS_RJT (RTV)
>  21  16.013032     01.01.01 -> 01.01.02     0x8 0xffff FCP SCSI: Inquiry LUN: 
> 0x00
>  22  16.021367     01.01.02 -> 01.01.01     0x8 0x68d FCP SCSI: Response LUN: 
> 0x00 (Inquiry) (Check Condition) LUN:0x00
>  23  16.028854     01.01.01 -> 01.01.02     0x9 0xffff FC ELS REC
> 
> REC is asking about OX_ID 8.  Not sure why we sent an REC here,
> the response should be the end of that exchange.
> 
> We reject the incoming REC.  Why?  It says 'command not supported'
> but should say invalid OX_ID or something indicating the XID is complete.
> 
>  24  16.028894     01.01.02 -> 01.01.01     0x9 0x68e FC ELS LS_RJT (REC)
>  25  17.999900     01.01.01 -> ff.ff.fc     0x5 0xffff BLS ABTS
> 
> discovery timed out.
> 
>  26  22.256120     01.01.01 -> 01.01.02     0x8 0x68d BLS ABTS
> 
> The above ABTS seems like a bug ... we should've seen the response above.  
> Why was the REC sent?
> 10 seconds later, the Test Unit Ready goes out and we die after the response. 
>  This seems pretty
> normal and unrelated to the previous stuff.
> 
>  27  32.256311     01.01.01 -> 01.01.02     0xa 0xffff FCP [FCP LU_RESET] 
> SCSI: Test Unit Ready LUN: 0x00


I did not grep this output.

Is this is a lu reset TMF or a TUR?

>  28  32.264601     01.01.02 -> 01.01.01     0xa 0x68f FCP SCSI: Response LUN: 
> 0x00 (Test Unit Ready) (Check Condition)
> LUN:0x00
> 

if it is a lu reset TMF then the fsp would not have a scsi command set. 
Then if we did get a check condition on a TMF then we would hit that 
null ptr since we do not expect to ever get a check condition with sense 
for a tmf.


It would be weird to be sending a TUR when we have no devices. I do not 
think scsi_scan.c will do this. Well, it would be weird to be sedning a 
lu reset when we have no device too, so something is screwy.

I think we can get to the lu reset case though if one of the scsi_scan.c 
IOs timesout. It looks like a device might be on the host __devices list 
  (scsi_alloc_sdev will call scsi_sysfs_device_initialize and that adds 
it to the __devices list a little early (it adds it before we scan and 
not when the device is actually known to be there). So when 
scsi_eh_bus_device_reset runs we would try to reset it even though it is 
not setup/known.


> I don't see how fsp->cmd can be NULL in this case.
> Maybe someone with more recent familiarity with fc_fcp.c will see the problem.
> 
>       Cheers,
>       Joe
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-fcoe.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] [Patch] open-fcoe-target patch for latest kernel branch

Reply via email to