Joe Eykholt wrote:
> Sivaram Kannan wrote:
>> Hi all,
>>
>> I have created a patch for Open-Fcoe-Target which could compile against the
>> latest kernel test
> branch(linux-2.6.29-rc8). The same patch will also compile against the latest
> stable version 2.6.28. I exposed the
> target and when I triggered the initiator, I could see sequence of SCSI
> packets at the wireshark. I did not understand
> the messages completely, I am decoding it. One problem that I noticed during
> the testing is with the initiator. For
> initiator I compiled the latest 2.6.29-rc8 in Fedora 10, and when I give the
> command "echo "ethX" >
> /sys/module/fcoe/parameters/create", I am getting a kernel opps. There is no
> trouble at the target end, wireshark was
> still printing SCSI packets. Would attaching the wireshark trace from target
> end would be helpful to identify what is
> the problem?
>
> I see a problem, too. Maybe its the same one, or maybe another one.
>
> With no LUNs exported from the target, the latest initiator
> (fcoe-features.git as of 3/17/09
> just after the FIP patches were integrated) I get this crash:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
> IP: [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
> PGD 7bc33067 PUD 7bc64067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/class/fc_remote_ports/rport-6:0-0/roles
> CPU 0
> Modules linked in: fcoe libfcoe libfc ata_generic ata_piix e1000
> Pid: 4213, comm: fcoethread/0 Tainted: G W 2.6.29-rc7-feat2 #1
> PowerEdge SC1425
> RIP: 0010:[<ffffffffa003b733>] [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc
> [libfc]
> RSP: 0018:ffff88007249dd80 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff88007dd7c060 RCX: 0000000000000060
> RDX: 0000000000000002 RSI: ffff880075c5d316 RDI: ffff88007242b748
> RBP: ffff88007249ddc0 R08: 000000000000006c R09: 0000000000000060
> R10: 0000000000000000 R11: 0000000000000046 R12: ffff880075c5d2fe
> R13: ffff88007dd7c002 R14: 0000000000000084 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffffffff808b2080(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000088 CR3: 000000007d1c7000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process fcoethread/0 (pid: 4213, threadinfo ffff88007249c000, task
> ffff8800724ea0c0)
> Stack:
> ffff88007dd7c070 ffff88007dc1ce00 ffff88007dc1ce00 ffff88007dd7c060
> ffff88007dc1ce00 ffff88007dd7c070 ffff88007dd802c8 ffffffffa003b848
> ffff88007249ddf0 ffffffffa003b8d6 0000000000980000 ffff88007dd801b8
> Call Trace:
> [<ffffffffa003b848>] ? fc_tm_done+0x0/0xc6 [libfc]
> [<ffffffffa003b8d6>] fc_tm_done+0x8e/0xc6 [libfc]
> [<ffffffffa00344f7>] fc_exch_recv+0x89a/0xdcf [libfc]
> [<ffffffffa004ea92>] fcoe_percpu_receive_thread+0x382/0x3a0 [fcoe]
> [<ffffffffa004e710>] ? fcoe_percpu_receive_thread+0x0/0x3a0 [fcoe]
> [<ffffffff8024b7ef>] kthread+0x49/0x76
> [<ffffffff8020c63a>] child_rip+0xa/0x20
> [<ffffffff8020c03c>] ? restore_args+0x0/0x30
> [<ffffffff8024b7a6>] ? kthread+0x0/0x76
> [<ffffffff8020c630>] ? child_rip+0x0/0x20
> Code: 01 00 00 e8 c0 14 1f e0 e9 24 01 00 00 f6 c2 02 74 24 8b 40 04 41 b9 60
> 00 00 00 0f c8 83 f8 60 44 0f 46 c8 48 8b
> 43 40 44 89 c9 <48> 8b b8 88 00 00 00 f3 a4 eb 03 45 31 c9 f6 c2 0c 74 49 31
> c0
> RIP [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
> RSP <ffff88007249dd80>
> CR2: 0000000000000088
> ---[ end trace 0be88afb67690d86 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
>
>
> The fsp has a NULL cmd pointer in fc_fcp_resp() in this code:
>
> if (flags & FCP_SNS_LEN_VAL) {
> snsl = ntohl(rp_ex->fr_sns_len);
> if (snsl > SCSI_SENSE_BUFFERSIZE)
> snsl = SCSI_SENSE_BUFFERSIZE;
> ==> blew up here since fsp->cmd is NULL.
> memcpy(fsp->cmd->sense_buffer,
> (char *)fc_rp_info + respl, snsl);
> }
>
> For those that like assembly:
>
> 1ae3: 0f c8 bswap %eax
> 1ae5: 83 f8 60 cmp $0x60,%eax
> 1ae8: 44 0f 46 c8 cmovbe %eax,%r9d
> 1aec: 48 8b 43 40 mov 0x40(%rbx),%rax
> 1af0: 44 89 c9 mov %r9d,%ecx
> ----- blew up here fetching pointer to memcpy destination.
> 1af3: 48 8b b8 88 00 00 00 mov 0x88(%rax),%rdi
> 1afa: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
>
> The trace shows a check condition response to a Test Unit Ready.
> I haven't looked further. I don't know why fsp->cmd is NULL.
>
> Joe
In case this helps, here's a trace. There were some timeouts so maybe that
leads to the fsp->cmd being NULL for that final test-unit-ready.
./tshark -n -r ~/ws/traces/jre.cap
Running as user "root" and group "root". This could be dangerous.
S_ID D_ID OX_ID RX_ID
1 0.000000 00.00.00 -> ff.ff.fe 0x67f 0xffff FC ELS FLOGI
2 1.999918 00.00.00 -> ff.ff.fe 0x680 0xffff FC ELS FLOGI
3 3.999852 00.00.00 -> ff.ff.fe 0x681 0xffff FC ELS FLOGI
4 5.999783 00.00.00 -> ff.ff.fe 0x682 0xffff FC ELS FLOGI
5 7.999716 00.00.00 -> ff.ff.fe 0x683 0xffff FC ELS FLOGI
6 9.999650 00.00.00 -> ff.ff.fe 0x684 0xffff FC ELS FLOGI
7 11.999582 00.00.00 -> ff.ff.fe 0x685 0xffff FC ELS FLOGI
8 13.999516 00.00.00 -> ff.ff.fe 0x686 0xffff FC ELS FLOGI
9 14.753934 00:14:22:72:67:eb -> 01:10:18:01:00:02 0x0 0x0 FIP Solicitation
10 15.999447 00.00.00 -> ff.ff.fe 0x687 0xffff FC ELS FLOGI
11 15.999761 ff.ff.fe -> 00.00.00 0x687 0x2 FC ELS ACC (FLOGI)
12 16.003827 01.01.01 -> 01.01.02 0x3 0x688 FC ELS PLOGI
13 16.003845 01.01.01 -> 01.01.02 0x4 0x689 FC ELS PLOGI
14 16.003857 01.01.01 -> ff.ff.fc 0x5 0x68a dNS GPN_FT
Why are we doing discovery in a point-to-point setup? We should skip that.
Note that this is from the initiator. The old code in the target correctly
doesn't do discovery. This may be harmless, though.
15 16.010592 01.01.02 -> 01.01.01 0x3 0x688 FC ELS ACC (PLOGI)
16 16.010617 01.01.02 -> 01.01.01 0x4 0x689 FC ELS ACC (PLOGI)
17 16.010813 01.01.01 -> 01.01.02 0x6 0xffff FC ELS PRLI
18 16.011348 01.01.02 -> 01.01.01 0x6 0x68b FC ELS ACC (PRLI)
19 16.011495 01.01.01 -> 01.01.02 0x7 0xffff FC ELS RTV
20 16.011606 01.01.02 -> 01.01.01 0x7 0x68c FC ELS LS_RJT (RTV)
21 16.013032 01.01.01 -> 01.01.02 0x8 0xffff FCP SCSI: Inquiry LUN:
0x00
22 16.021367 01.01.02 -> 01.01.01 0x8 0x68d FCP SCSI: Response LUN:
0x00 (Inquiry) (Check Condition) LUN:0x00
23 16.028854 01.01.01 -> 01.01.02 0x9 0xffff FC ELS REC
REC is asking about OX_ID 8. Not sure why we sent an REC here,
the response should be the end of that exchange.
We reject the incoming REC. Why? It says 'command not supported'
but should say invalid OX_ID or something indicating the XID is complete.
24 16.028894 01.01.02 -> 01.01.01 0x9 0x68e FC ELS LS_RJT (REC)
25 17.999900 01.01.01 -> ff.ff.fc 0x5 0xffff BLS ABTS
discovery timed out.
26 22.256120 01.01.01 -> 01.01.02 0x8 0x68d BLS ABTS
The above ABTS seems like a bug ... we should've seen the response above. Why
was the REC sent?
10 seconds later, the Test Unit Ready goes out and we die after the response.
This seems pretty
normal and unrelated to the previous stuff.
27 32.256311 01.01.01 -> 01.01.02 0xa 0xffff FCP [FCP LU_RESET] SCSI:
Test Unit Ready LUN: 0x00
28 32.264601 01.01.02 -> 01.01.01 0xa 0x68f FCP SCSI: Response LUN:
0x00 (Test Unit Ready) (Check Condition)
LUN:0x00
I don't see how fsp->cmd can be NULL in this case.
Maybe someone with more recent familiarity with fc_fcp.c will see the problem.
Cheers,
Joe
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel