Re: [Open-FCoE] [Patch] open-fcoe-target patch for latest kernel branch

Joe Eykholt Tue, 17 Mar 2009 16:04:54 -0700

Joe Eykholt wrote:
> Sivaram Kannan wrote:
>> Hi all,
>>
>> I have created a patch for Open-Fcoe-Target which could compile against the 
>> latest kernel test
> branch(linux-2.6.29-rc8). The same patch will also compile against the latest 
> stable version 2.6.28. I exposed the
> target and when I triggered the initiator, I could see sequence of SCSI 
> packets at the wireshark. I did not understand
> the messages completely, I am decoding it. One problem that I noticed during 
> the testing is with the initiator. For
> initiator I compiled the latest 2.6.29-rc8 in Fedora 10, and when I give the 
> command "echo "ethX" >
> /sys/module/fcoe/parameters/create", I am getting a kernel opps. There is no 
> trouble at the target end, wireshark was
> still printing SCSI packets. Would attaching the wireshark trace from target 
> end would be helpful to identify what is
> the problem?
> 
> I see a problem, too.  Maybe its the same one, or maybe another one.
> 
> With no LUNs exported from the target, the latest initiator 
> (fcoe-features.git as of 3/17/09
> just after the FIP patches were integrated) I get this crash:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
> IP: [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
> PGD 7bc33067 PUD 7bc64067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/class/fc_remote_ports/rport-6:0-0/roles
> CPU 0
> Modules linked in: fcoe libfcoe libfc ata_generic ata_piix e1000
> Pid: 4213, comm: fcoethread/0 Tainted: G        W  2.6.29-rc7-feat2 #1 
> PowerEdge SC1425
> RIP: 0010:[<ffffffffa003b733>]  [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc 
> [libfc]
> RSP: 0018:ffff88007249dd80  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff88007dd7c060 RCX: 0000000000000060
> RDX: 0000000000000002 RSI: ffff880075c5d316 RDI: ffff88007242b748
> RBP: ffff88007249ddc0 R08: 000000000000006c R09: 0000000000000060
> R10: 0000000000000000 R11: 0000000000000046 R12: ffff880075c5d2fe
> R13: ffff88007dd7c002 R14: 0000000000000084 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffffffff808b2080(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000088 CR3: 000000007d1c7000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process fcoethread/0 (pid: 4213, threadinfo ffff88007249c000, task 
> ffff8800724ea0c0)
> Stack:
>  ffff88007dd7c070 ffff88007dc1ce00 ffff88007dc1ce00 ffff88007dd7c060
>  ffff88007dc1ce00 ffff88007dd7c070 ffff88007dd802c8 ffffffffa003b848
>  ffff88007249ddf0 ffffffffa003b8d6 0000000000980000 ffff88007dd801b8
> Call Trace:
>  [<ffffffffa003b848>] ? fc_tm_done+0x0/0xc6 [libfc]
>  [<ffffffffa003b8d6>] fc_tm_done+0x8e/0xc6 [libfc]
>  [<ffffffffa00344f7>] fc_exch_recv+0x89a/0xdcf [libfc]
>  [<ffffffffa004ea92>] fcoe_percpu_receive_thread+0x382/0x3a0 [fcoe]
>  [<ffffffffa004e710>] ? fcoe_percpu_receive_thread+0x0/0x3a0 [fcoe]
>  [<ffffffff8024b7ef>] kthread+0x49/0x76
>  [<ffffffff8020c63a>] child_rip+0xa/0x20
>  [<ffffffff8020c03c>] ? restore_args+0x0/0x30
>  [<ffffffff8024b7a6>] ? kthread+0x0/0x76
>  [<ffffffff8020c630>] ? child_rip+0x0/0x20
> Code: 01 00 00 e8 c0 14 1f e0 e9 24 01 00 00 f6 c2 02 74 24 8b 40 04 41 b9 60 
> 00 00 00 0f c8 83 f8 60 44 0f 46 c8 48 8b
> 43 40 44 89 c9 <48> 8b b8 88 00 00 00 f3 a4 eb 03 45 31 c9 f6 c2 0c 74 49 31 
> c0
> RIP  [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>  RSP <ffff88007249dd80>
> CR2: 0000000000000088
> ---[ end trace 0be88afb67690d86 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> The fsp has a NULL cmd pointer in fc_fcp_resp() in this code:
> 
>                         if (flags & FCP_SNS_LEN_VAL) {
>                                 snsl = ntohl(rp_ex->fr_sns_len);
>                                 if (snsl > SCSI_SENSE_BUFFERSIZE)
>                                         snsl = SCSI_SENSE_BUFFERSIZE;
> ==> blew up here since fsp->cmd is NULL.
>                                 memcpy(fsp->cmd->sense_buffer,
>                                        (char *)fc_rp_info + respl, snsl);
>                         }
> 
> For those that like assembly:
> 
>     1ae3:       0f c8                   bswap  %eax
>     1ae5:       83 f8 60                cmp    $0x60,%eax
>     1ae8:       44 0f 46 c8             cmovbe %eax,%r9d
>     1aec:       48 8b 43 40             mov    0x40(%rbx),%rax
>     1af0:       44 89 c9                mov    %r9d,%ecx
> ----- blew up here fetching pointer to memcpy destination.
>     1af3:       48 8b b8 88 00 00 00    mov    0x88(%rax),%rdi
>     1afa:       f3 a4                   rep movsb %ds:(%rsi),%es:(%rdi)
> 
> The trace shows a check condition response to a Test Unit Ready.
> I haven't looked further.  I don't know why fsp->cmd is NULL.
> 
>       Joe


In case this helps, here's a trace.  There were some timeouts so maybe that
leads to the fsp->cmd being NULL for that final test-unit-ready.

./tshark -n -r ~/ws/traces/jre.cap
Running as user "root" and group "root". This could be dangerous.

                   S_ID        D_ID         OX_ID  RX_ID
  1   0.000000     00.00.00 -> ff.ff.fe     0x67f 0xffff FC ELS FLOGI
  2   1.999918     00.00.00 -> ff.ff.fe     0x680 0xffff FC ELS FLOGI
  3   3.999852     00.00.00 -> ff.ff.fe     0x681 0xffff FC ELS FLOGI
  4   5.999783     00.00.00 -> ff.ff.fe     0x682 0xffff FC ELS FLOGI
  5   7.999716     00.00.00 -> ff.ff.fe     0x683 0xffff FC ELS FLOGI
  6   9.999650     00.00.00 -> ff.ff.fe     0x684 0xffff FC ELS FLOGI
  7  11.999582     00.00.00 -> ff.ff.fe     0x685 0xffff FC ELS FLOGI
  8  13.999516     00.00.00 -> ff.ff.fe     0x686 0xffff FC ELS FLOGI
  9  14.753934 00:14:22:72:67:eb -> 01:10:18:01:00:02 0x0 0x0 FIP Solicitation
 10  15.999447     00.00.00 -> ff.ff.fe     0x687 0xffff FC ELS FLOGI
 11  15.999761     ff.ff.fe -> 00.00.00     0x687 0x2 FC ELS ACC (FLOGI)
 12  16.003827     01.01.01 -> 01.01.02     0x3 0x688 FC ELS PLOGI
 13  16.003845     01.01.01 -> 01.01.02     0x4 0x689 FC ELS PLOGI
 14  16.003857     01.01.01 -> ff.ff.fc     0x5 0x68a dNS GPN_FT

Why are we doing discovery in a point-to-point setup?  We should skip that.
Note that this is from the initiator.  The old code in the target correctly
doesn't do discovery.  This may be harmless, though.

 15  16.010592     01.01.02 -> 01.01.01     0x3 0x688 FC ELS ACC (PLOGI)
 16  16.010617     01.01.02 -> 01.01.01     0x4 0x689 FC ELS ACC (PLOGI)
 17  16.010813     01.01.01 -> 01.01.02     0x6 0xffff FC ELS PRLI
 18  16.011348     01.01.02 -> 01.01.01     0x6 0x68b FC ELS ACC (PRLI)
 19  16.011495     01.01.01 -> 01.01.02     0x7 0xffff FC ELS RTV
 20  16.011606     01.01.02 -> 01.01.01     0x7 0x68c FC ELS LS_RJT (RTV)
 21  16.013032     01.01.01 -> 01.01.02     0x8 0xffff FCP SCSI: Inquiry LUN: 
0x00
 22  16.021367     01.01.02 -> 01.01.01     0x8 0x68d FCP SCSI: Response LUN: 
0x00 (Inquiry) (Check Condition) LUN:0x00
 23  16.028854     01.01.01 -> 01.01.02     0x9 0xffff FC ELS REC

REC is asking about OX_ID 8.  Not sure why we sent an REC here,
the response should be the end of that exchange.

We reject the incoming REC.  Why?  It says 'command not supported'
but should say invalid OX_ID or something indicating the XID is complete.

 24  16.028894     01.01.02 -> 01.01.01     0x9 0x68e FC ELS LS_RJT (REC)
 25  17.999900     01.01.01 -> ff.ff.fc     0x5 0xffff BLS ABTS

discovery timed out.

 26  22.256120     01.01.01 -> 01.01.02     0x8 0x68d BLS ABTS

The above ABTS seems like a bug ... we should've seen the response above.  Why 
was the REC sent?
10 seconds later, the Test Unit Ready goes out and we die after the response.  
This seems pretty
normal and unrelated to the previous stuff.

 27  32.256311     01.01.01 -> 01.01.02     0xa 0xffff FCP [FCP LU_RESET] SCSI: 
Test Unit Ready LUN: 0x00
 28  32.264601     01.01.02 -> 01.01.01     0xa 0x68f FCP SCSI: Response LUN: 
0x00 (Test Unit Ready) (Check Condition)
LUN:0x00

I don't see how fsp->cmd can be NULL in this case.
Maybe someone with more recent familiarity with fc_fcp.c will see the problem.

        Cheers,
        Joe
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Re: [Open-FCoE] [Patch] open-fcoe-target patch for latest kernel branch

Reply via email to