Joe Eykholt wrote:
> Joe Eykholt wrote:
>> Sivaram Kannan wrote:
>>> Hi all,
>>>
>>> I have created a patch for Open-Fcoe-Target which could compile against the
>>> latest kernel test
>> branch(linux-2.6.29-rc8). The same patch will also compile against the
>> latest stable version 2.6.28. I exposed the
>> target and when I triggered the initiator, I could see sequence of SCSI
>> packets at the wireshark. I did not understand
>> the messages completely, I am decoding it. One problem that I noticed during
>> the testing is with the initiator. For
>> initiator I compiled the latest 2.6.29-rc8 in Fedora 10, and when I give the
>> command "echo "ethX" >
>> /sys/module/fcoe/parameters/create", I am getting a kernel opps. There is no
>> trouble at the target end, wireshark was
>> still printing SCSI packets. Would attaching the wireshark trace from target
>> end would be helpful to identify what is
>> the problem?
>>
>> I see a problem, too. Maybe its the same one, or maybe another one.
>>
>> With no LUNs exported from the target, the latest initiator
>> (fcoe-features.git as of 3/17/09
>> just after the FIP patches were integrated) I get this crash:
>>
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
>> IP: [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>> PGD 7bc33067 PUD 7bc64067 PMD 0
>> Oops: 0000 [#1] SMP
>> last sysfs file: /sys/class/fc_remote_ports/rport-6:0-0/roles
>> CPU 0
>> Modules linked in: fcoe libfcoe libfc ata_generic ata_piix e1000
>> Pid: 4213, comm: fcoethread/0 Tainted: G W 2.6.29-rc7-feat2 #1
>> PowerEdge SC1425
>> RIP: 0010:[<ffffffffa003b733>] [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc
>> [libfc]
>> RSP: 0018:ffff88007249dd80 EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: ffff88007dd7c060 RCX: 0000000000000060
>> RDX: 0000000000000002 RSI: ffff880075c5d316 RDI: ffff88007242b748
>> RBP: ffff88007249ddc0 R08: 000000000000006c R09: 0000000000000060
>> R10: 0000000000000000 R11: 0000000000000046 R12: ffff880075c5d2fe
>> R13: ffff88007dd7c002 R14: 0000000000000084 R15: 0000000000000000
>> FS: 0000000000000000(0000) GS:ffffffff808b2080(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 0000000000000088 CR3: 000000007d1c7000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process fcoethread/0 (pid: 4213, threadinfo ffff88007249c000, task
>> ffff8800724ea0c0)
>> Stack:
>> ffff88007dd7c070 ffff88007dc1ce00 ffff88007dc1ce00 ffff88007dd7c060
>> ffff88007dc1ce00 ffff88007dd7c070 ffff88007dd802c8 ffffffffa003b848
>> ffff88007249ddf0 ffffffffa003b8d6 0000000000980000 ffff88007dd801b8
>> Call Trace:
>> [<ffffffffa003b848>] ? fc_tm_done+0x0/0xc6 [libfc]
>> [<ffffffffa003b8d6>] fc_tm_done+0x8e/0xc6 [libfc]
>> [<ffffffffa00344f7>] fc_exch_recv+0x89a/0xdcf [libfc]
>> [<ffffffffa004ea92>] fcoe_percpu_receive_thread+0x382/0x3a0 [fcoe]
>> [<ffffffffa004e710>] ? fcoe_percpu_receive_thread+0x0/0x3a0 [fcoe]
>> [<ffffffff8024b7ef>] kthread+0x49/0x76
>> [<ffffffff8020c63a>] child_rip+0xa/0x20
>> [<ffffffff8020c03c>] ? restore_args+0x0/0x30
>> [<ffffffff8024b7a6>] ? kthread+0x0/0x76
>> [<ffffffff8020c630>] ? child_rip+0x0/0x20
>> Code: 01 00 00 e8 c0 14 1f e0 e9 24 01 00 00 f6 c2 02 74 24 8b 40 04 41 b9
>> 60 00 00 00 0f c8 83 f8 60 44 0f 46 c8 48 8b
>> 43 40 44 89 c9 <48> 8b b8 88 00 00 00 f3 a4 eb 03 45 31 c9 f6 c2 0c 74 49 31
>> c0
>> RIP [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>> RSP <ffff88007249dd80>
>> CR2: 0000000000000088
>> ---[ end trace 0be88afb67690d86 ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>>
>>
>> The fsp has a NULL cmd pointer in fc_fcp_resp() in this code:
>>
>> if (flags & FCP_SNS_LEN_VAL) {
>> snsl = ntohl(rp_ex->fr_sns_len);
>> if (snsl > SCSI_SENSE_BUFFERSIZE)
>> snsl = SCSI_SENSE_BUFFERSIZE;
>> ==> blew up here since fsp->cmd is NULL.
>> memcpy(fsp->cmd->sense_buffer,
>> (char *)fc_rp_info + respl, snsl);
>> }
>>
>> For those that like assembly:
>>
>> 1ae3: 0f c8 bswap %eax
>> 1ae5: 83 f8 60 cmp $0x60,%eax
>> 1ae8: 44 0f 46 c8 cmovbe %eax,%r9d
>> 1aec: 48 8b 43 40 mov 0x40(%rbx),%rax
>> 1af0: 44 89 c9 mov %r9d,%ecx
>> ----- blew up here fetching pointer to memcpy destination.
>> 1af3: 48 8b b8 88 00 00 00 mov 0x88(%rax),%rdi
>> 1afa: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
>>
>> The trace shows a check condition response to a Test Unit Ready.
>> I haven't looked further. I don't know why fsp->cmd is NULL.
>>
>> Joe
>
> In case this helps, here's a trace. There were some timeouts so maybe that
> leads to the fsp->cmd being NULL for that final test-unit-ready.
>
> ./tshark -n -r ~/ws/traces/jre.cap
> Running as user "root" and group "root". This could be dangerous.
>
> S_ID D_ID OX_ID RX_ID
> 1 0.000000 00.00.00 -> ff.ff.fe 0x67f 0xffff FC ELS FLOGI
> 2 1.999918 00.00.00 -> ff.ff.fe 0x680 0xffff FC ELS FLOGI
> 3 3.999852 00.00.00 -> ff.ff.fe 0x681 0xffff FC ELS FLOGI
> 4 5.999783 00.00.00 -> ff.ff.fe 0x682 0xffff FC ELS FLOGI
> 5 7.999716 00.00.00 -> ff.ff.fe 0x683 0xffff FC ELS FLOGI
> 6 9.999650 00.00.00 -> ff.ff.fe 0x684 0xffff FC ELS FLOGI
> 7 11.999582 00.00.00 -> ff.ff.fe 0x685 0xffff FC ELS FLOGI
> 8 13.999516 00.00.00 -> ff.ff.fe 0x686 0xffff FC ELS FLOGI
> 9 14.753934 00:14:22:72:67:eb -> 01:10:18:01:00:02 0x0 0x0 FIP Solicitation
> 10 15.999447 00.00.00 -> ff.ff.fe 0x687 0xffff FC ELS FLOGI
> 11 15.999761 ff.ff.fe -> 00.00.00 0x687 0x2 FC ELS ACC (FLOGI)
> 12 16.003827 01.01.01 -> 01.01.02 0x3 0x688 FC ELS PLOGI
> 13 16.003845 01.01.01 -> 01.01.02 0x4 0x689 FC ELS PLOGI
> 14 16.003857 01.01.01 -> ff.ff.fc 0x5 0x68a dNS GPN_FT
>
> Why are we doing discovery in a point-to-point setup? We should skip that.
> Note that this is from the initiator. The old code in the target correctly
> doesn't do discovery. This may be harmless, though.
>
> 15 16.010592 01.01.02 -> 01.01.01 0x3 0x688 FC ELS ACC (PLOGI)
> 16 16.010617 01.01.02 -> 01.01.01 0x4 0x689 FC ELS ACC (PLOGI)
> 17 16.010813 01.01.01 -> 01.01.02 0x6 0xffff FC ELS PRLI
> 18 16.011348 01.01.02 -> 01.01.01 0x6 0x68b FC ELS ACC (PRLI)
> 19 16.011495 01.01.01 -> 01.01.02 0x7 0xffff FC ELS RTV
> 20 16.011606 01.01.02 -> 01.01.01 0x7 0x68c FC ELS LS_RJT (RTV)
> 21 16.013032 01.01.01 -> 01.01.02 0x8 0xffff FCP SCSI: Inquiry LUN:
> 0x00
> 22 16.021367 01.01.02 -> 01.01.01 0x8 0x68d FCP SCSI: Response LUN:
> 0x00 (Inquiry) (Check Condition) LUN:0x00
> 23 16.028854 01.01.01 -> 01.01.02 0x9 0xffff FC ELS REC
>
> REC is asking about OX_ID 8. Not sure why we sent an REC here,
> the response should be the end of that exchange.
>
> We reject the incoming REC. Why? It says 'command not supported'
> but should say invalid OX_ID or something indicating the XID is complete.
>
> 24 16.028894 01.01.02 -> 01.01.01 0x9 0x68e FC ELS LS_RJT (REC)
> 25 17.999900 01.01.01 -> ff.ff.fc 0x5 0xffff BLS ABTS
>
> discovery timed out.
>
> 26 22.256120 01.01.01 -> 01.01.02 0x8 0x68d BLS ABTS
>
> The above ABTS seems like a bug ... we should've seen the response above.
> Why was the REC sent?
> 10 seconds later, the Test Unit Ready goes out and we die after the response.
> This seems pretty
> normal and unrelated to the previous stuff.
>
> 27 32.256311 01.01.01 -> 01.01.02 0xa 0xffff FCP [FCP LU_RESET]
> SCSI: Test Unit Ready LUN: 0x00
I did not grep this output.
Is this is a lu reset TMF or a TUR?
> 28 32.264601 01.01.02 -> 01.01.01 0xa 0x68f FCP SCSI: Response LUN:
> 0x00 (Test Unit Ready) (Check Condition)
> LUN:0x00
>
if it is a lu reset TMF then the fsp would not have a scsi command set.
Then if we did get a check condition on a TMF then we would hit that
null ptr since we do not expect to ever get a check condition with sense
for a tmf.
It would be weird to be sending a TUR when we have no devices. I do not
think scsi_scan.c will do this. Well, it would be weird to be sedning a
lu reset when we have no device too, so something is screwy.
I think we can get to the lu reset case though if one of the scsi_scan.c
IOs timesout. It looks like a device might be on the host __devices list
(scsi_alloc_sdev will call scsi_sysfs_device_initialize and that adds
it to the __devices list a little early (it adds it before we scan and
not when the device is actually known to be there). So when
scsi_eh_bus_device_reset runs we would try to reset it even though it is
not setup/known.
> I don't see how fsp->cmd can be NULL in this case.
> Maybe someone with more recent familiarity with fc_fcp.c will see the problem.
>
> Cheers,
> Joe
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-fcoe.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel