Mike Christie wrote:
> Joe Eykholt wrote:
>> Joe Eykholt wrote:
>>> Sivaram Kannan wrote:
>>>> Hi all,
>>>>
>>>> I have created a patch for Open-Fcoe-Target which could compile
> against the latest kernel test
>>> branch(linux-2.6.29-rc8). The same patch will also compile against
> the latest stable version 2.6.28. I exposed the
>>> target and when I triggered the initiator, I could see sequence of
> SCSI packets at the wireshark. I did not understand
>>> the messages completely, I am decoding it. One problem that I noticed
> during the testing is with the initiator. For
>>> initiator I compiled the latest 2.6.29-rc8 in Fedora 10, and when I
> give the command "echo "ethX" >
>>> /sys/module/fcoe/parameters/create", I am getting a kernel opps.
> There is no trouble at the target end, wireshark was
>>> still printing SCSI packets. Would attaching the wireshark trace from
> target end would be helpful to identify what is
>>> the problem?
>>>
>>> I see a problem, too. Maybe its the same one, or maybe another one.
>>>
>>> With no LUNs exported from the target, the latest initiator
> (fcoe-features.git as of 3/17/09
>>> just after the FIP patches were integrated) I get this crash:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
>>> IP: [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>>> PGD 7bc33067 PUD 7bc64067 PMD 0
>>> Oops: 0000 [#1] SMP
>>> last sysfs file: /sys/class/fc_remote_ports/rport-6:0-0/roles
>>> CPU 0
>>> Modules linked in: fcoe libfcoe libfc ata_generic ata_piix e1000
>>> Pid: 4213, comm: fcoethread/0 Tainted: G W 2.6.29-rc7-feat2
> #1 PowerEdge SC1425
>>> RIP: 0010:[<ffffffffa003b733>] [<ffffffffa003b733>]
> fc_fcp_resp+0xe7/0x1fc [libfc]
>>> RSP: 0018:ffff88007249dd80 EFLAGS: 00010246
>>> RAX: 0000000000000000 RBX: ffff88007dd7c060 RCX: 0000000000000060
>>> RDX: 0000000000000002 RSI: ffff880075c5d316 RDI: ffff88007242b748
>>> RBP: ffff88007249ddc0 R08: 000000000000006c R09: 0000000000000060
>>> R10: 0000000000000000 R11: 0000000000000046 R12: ffff880075c5d2fe
>>> R13: ffff88007dd7c002 R14: 0000000000000084 R15: 0000000000000000
>>> FS: 0000000000000000(0000) GS:ffffffff808b2080(0000)
> knlGS:0000000000000000
>>> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>> CR2: 0000000000000088 CR3: 000000007d1c7000 CR4: 00000000000006e0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process fcoethread/0 (pid: 4213, threadinfo ffff88007249c000, task
> ffff8800724ea0c0)
>>> Stack:
>>> ffff88007dd7c070 ffff88007dc1ce00 ffff88007dc1ce00 ffff88007dd7c060
>>> ffff88007dc1ce00 ffff88007dd7c070 ffff88007dd802c8 ffffffffa003b848
>>> ffff88007249ddf0 ffffffffa003b8d6 0000000000980000 ffff88007dd801b8
>>> Call Trace:
>>> [<ffffffffa003b848>] ? fc_tm_done+0x0/0xc6 [libfc]
>>> [<ffffffffa003b8d6>] fc_tm_done+0x8e/0xc6 [libfc]
>>> [<ffffffffa00344f7>] fc_exch_recv+0x89a/0xdcf [libfc]
>>> [<ffffffffa004ea92>] fcoe_percpu_receive_thread+0x382/0x3a0 [fcoe]
>>> [<ffffffffa004e710>] ? fcoe_percpu_receive_thread+0x0/0x3a0 [fcoe]
>>> [<ffffffff8024b7ef>] kthread+0x49/0x76
>>> [<ffffffff8020c63a>] child_rip+0xa/0x20
>>> [<ffffffff8020c03c>] ? restore_args+0x0/0x30
>>> [<ffffffff8024b7a6>] ? kthread+0x0/0x76
>>> [<ffffffff8020c630>] ? child_rip+0x0/0x20
>>> Code: 01 00 00 e8 c0 14 1f e0 e9 24 01 00 00 f6 c2 02 74 24 8b 40 04
> 41 b9 60 00 00 00 0f c8 83 f8 60 44 0f 46 c8 48 8b
>>> 43 40 44 89 c9 <48> 8b b8 88 00 00 00 f3 a4 eb 03 45 31 c9 f6 c2 0c
> 74 49 31 c0
>>> RIP [<ffffffffa003b733>] fc_fcp_resp+0xe7/0x1fc [libfc]
>>> RSP <ffff88007249dd80>
>>> CR2: 0000000000000088
>>> ---[ end trace 0be88afb67690d86 ]---
>>> Kernel panic - not syncing: Fatal exception in interrupt
>>>
>>>
>>> The fsp has a NULL cmd pointer in fc_fcp_resp() in this code:
>>>
>>> if (flags & FCP_SNS_LEN_VAL) {
>>> snsl = ntohl(rp_ex->fr_sns_len);
>>> if (snsl > SCSI_SENSE_BUFFERSIZE)
>>> snsl = SCSI_SENSE_BUFFERSIZE;
>>> ==> blew up here since fsp->cmd is NULL.
>>> memcpy(fsp->cmd->sense_buffer,
>>> (char *)fc_rp_info + respl, snsl);
>>> }
>>>
>>> For those that like assembly:
>>>
>>> 1ae3: 0f c8 bswap %eax
>>> 1ae5: 83 f8 60 cmp $0x60,%eax
>>> 1ae8: 44 0f 46 c8 cmovbe %eax,%r9d
>>> 1aec: 48 8b 43 40 mov 0x40(%rbx),%rax
>>> 1af0: 44 89 c9 mov %r9d,%ecx
>>> ----- blew up here fetching pointer to memcpy destination.
>>> 1af3: 48 8b b8 88 00 00 00 mov 0x88(%rax),%rdi
>>> 1afa: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
>>>
>>> The trace shows a check condition response to a Test Unit Ready.
>>> I haven't looked further. I don't know why fsp->cmd is NULL.
>>>
>>> Joe
>>
>> In case this helps, here's a trace. There were some timeouts so maybe
> that
>> leads to the fsp->cmd being NULL for that final test-unit-ready.
>>
>> ./tshark -n -r ~/ws/traces/jre.cap
>> Running as user "root" and group "root". This could be dangerous.
>>
>> S_ID D_ID OX_ID RX_ID
>> 1 0.000000 00.00.00 -> ff.ff.fe 0x67f 0xffff FC ELS FLOGI
>> 2 1.999918 00.00.00 -> ff.ff.fe 0x680 0xffff FC ELS FLOGI
>> 3 3.999852 00.00.00 -> ff.ff.fe 0x681 0xffff FC ELS FLOGI
>> 4 5.999783 00.00.00 -> ff.ff.fe 0x682 0xffff FC ELS FLOGI
>> 5 7.999716 00.00.00 -> ff.ff.fe 0x683 0xffff FC ELS FLOGI
>> 6 9.999650 00.00.00 -> ff.ff.fe 0x684 0xffff FC ELS FLOGI
>> 7 11.999582 00.00.00 -> ff.ff.fe 0x685 0xffff FC ELS FLOGI
>> 8 13.999516 00.00.00 -> ff.ff.fe 0x686 0xffff FC ELS FLOGI
>> 9 14.753934 00:14:22:72:67:eb -> 01:10:18:01:00:02 0x0 0x0 FIP
> Solicitation
>> 10 15.999447 00.00.00 -> ff.ff.fe 0x687 0xffff FC ELS FLOGI
>> 11 15.999761 ff.ff.fe -> 00.00.00 0x687 0x2 FC ELS ACC (FLOGI)
>> 12 16.003827 01.01.01 -> 01.01.02 0x3 0x688 FC ELS PLOGI
>> 13 16.003845 01.01.01 -> 01.01.02 0x4 0x689 FC ELS PLOGI
>> 14 16.003857 01.01.01 -> ff.ff.fc 0x5 0x68a dNS GPN_FT
>>
>> Why are we doing discovery in a point-to-point setup? We should skip
> that.
>> Note that this is from the initiator. The old code in the target
> correctly
>> doesn't do discovery. This may be harmless, though.
>>
>> 15 16.010592 01.01.02 -> 01.01.01 0x3 0x688 FC ELS ACC (PLOGI)
>> 16 16.010617 01.01.02 -> 01.01.01 0x4 0x689 FC ELS ACC (PLOGI)
>> 17 16.010813 01.01.01 -> 01.01.02 0x6 0xffff FC ELS PRLI
>> 18 16.011348 01.01.02 -> 01.01.01 0x6 0x68b FC ELS ACC (PRLI)
>> 19 16.011495 01.01.01 -> 01.01.02 0x7 0xffff FC ELS RTV
>> 20 16.011606 01.01.02 -> 01.01.01 0x7 0x68c FC ELS LS_RJT (RTV)
>> 21 16.013032 01.01.01 -> 01.01.02 0x8 0xffff FCP SCSI:
> Inquiry LUN: 0x00
>> 22 16.021367 01.01.02 -> 01.01.01 0x8 0x68d FCP SCSI:
> Response LUN: 0x00 (Inquiry) (Check Condition) LUN:0x00
>> 23 16.028854 01.01.01 -> 01.01.02 0x9 0xffff FC ELS REC
>>
>> REC is asking about OX_ID 8. Not sure why we sent an REC here,
>> the response should be the end of that exchange.
>>
>> We reject the incoming REC. Why? It says 'command not supported'
>> but should say invalid OX_ID or something indicating the XID is complete.
>>
>> 24 16.028894 01.01.02 -> 01.01.01 0x9 0x68e FC ELS LS_RJT (REC)
>> 25 17.999900 01.01.01 -> ff.ff.fc 0x5 0xffff BLS ABTS
>>
>> discovery timed out.
>>
>> 26 22.256120 01.01.01 -> 01.01.02 0x8 0x68d BLS ABTS
>>
>> The above ABTS seems like a bug ... we should've seen the response
> above. Why was the REC sent?
>> 10 seconds later, the Test Unit Ready goes out and we die after the
> response. This seems pretty
>> normal and unrelated to the previous stuff.
>>
>> 27 32.256311 01.01.01 -> 01.01.02 0xa 0xffff FCP [FCP
> LU_RESET] SCSI: Test Unit Ready LUN: 0x00
>
> I did not grep this output.
You mean 'grok'? :-) This is tshark/wireshark output.
I see what you mean, it's ambiguous.
I guess it's really a LU reset.
The task management flags in the FCP header are 0x10, LU reset.
The opcode was 0 which I guess is irrelevant, but tshark interprets
it as normal: test unit ready. So does the old openfc target code,
it turns out. It's a very bare bones implementation which doesn't
handle this correctly.
> Is this is a lu reset TMF or a TUR?
>
>> 28 32.264601 01.01.02 -> 01.01.01 0xa 0x68f FCP SCSI:
> Response LUN: 0x00 (Test Unit Ready) (Check Condition)
>> LUN:0x00
>>
>
> if it is a lu reset TMF then the fsp would not have a scsi command set.
> Then if we did get a check condition on a TMF then we would hit that
> null ptr since we do not expect to ever get a check condition with sense
> for a tmf.
Three bugs revealed here:
- fcp crashes from the sense.
- tshark interprets irrelevant opcode in both request and reply.
- target code sends sense on LU reset.
I'm only planning to fix the first of these for now. I'll generate a patch.
> It would be weird to be sending a TUR when we have no devices. I do not
> think scsi_scan.c will do this. Well, it would be weird to be sedning a
> lu reset when we have no device too, so something is screwy.
>
> I think we can get to the lu reset case though if one of the scsi_scan.c
> IOs timesout. It looks like a device might be on the host __devices list
> (scsi_alloc_sdev will call scsi_sysfs_device_initialize and that adds
> it to the __devices list a little early (it adds it before we scan and
> not when the device is actually known to be there). So when
> scsi_eh_bus_device_reset runs we would try to reset it even though it is
> not setup/known.
Thanks a lot, Mike. I figured you'd be able to shed light on this.
Joe
>
>
>> I don't see how fsp->cmd can be NULL in this case.
>> Maybe someone with more recent familiarity with fc_fcp.c will see the
> problem.
>>
>> Cheers,
>> Joe
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-fcoe.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel