Hello,
While testing some new SAS hardware, I have encountered an issue that results
in an "Unable to handle kernel NULL pointer dereference" message from the
kernel. The stack trace taken from syslog output is attached.
The problem occurs when connecting then disconnecting an external cable
between two JBOD disk boxes. The problem does not seem to occur when
connecting and disconnecting a single disk box directly to the HBA.
To reproduce:
1. Boot with the hardware connected as pictured below. All 32 external disks
are found and no problems are noticed.
2. Disconnect cable B. The 16 disks and enclosure target from disk box 2 are
removed with no errors noticed. There are some 'failed to synchronize
cache' messages if the disks are not removed through /sys first but the
the error will occur either way.
3. Reconnect cable B. No indications that anything has happened from the OS. I
have tried waiting for over 2 minutes after connecting the cable.
4. Disconnect cable B again and the attached messages are logged. A hard reset
is then required to recover.
+---Host w/LSI3801E HBA------------+
| LSI1068E |
+-####-####------------------------+
|||| < Cable A
+-####--Disk box 1-----------------+
| |||| |
| LSISASx12A |
| |||| ||\`== LSISASx12A < 8 HDDs |
| |||| || |
| |||| \`==== LSISASx12A < 8 HDDs |
+-####-----------------------------+
|||| < Cable B
+-####--Disk box 2-----------------+
| |||| |
| LSISASx12A |
| |||| ||\`== LSISASx12A < 8 HDDs |
| |||| || |
| |||| \`==== LSISASx12A < 8 HDDs |
+-####-----------------------------+
For the attached error, the disk boxes were full of SATA disks and the system
was running the Debian backports.org 2.6.21-1-amd64 (2.6.21-4~bpo.1) kernel.
The problem also seems to exist with the Debian etch 2.6.18-4-amd64 kernel.
Happy to try any kernel versions and configs that would be useful.
The diagram represents my current understanding of the expander setup in the
disk boxes but I could be mistaken. I can provide further details of the view
of the hardware from the host if it is of interest.
The server also has an on-board LSI1064 connected to 4 internal SAS HDDs:
$ cat /proc/mpt/summary
ioc0: LSISAS1068E, FwRev=01120000h, Ports=1, MaxQ=511, IRQ=19
ioc1: LSISAS1064, FwRev=01102800h, Ports=1, MaxQ=511, IRQ=58
I will continue to investigate and will report any findings but any help in
resolving the issue would be greatly appreciated.
--
Alex Winawer, Unix Systems Programmer
Systems Development & Support, Oxford University Computing Services
Jul 6 09:46:05 just-read-the-instructions kernel: mptbase: ioc0:
LogInfo(0x30050000): Originator={IOP}, Code={Task Terminated}, SubCode(0x0000)
Jul 6 09:48:09 just-read-the-instructions kernel: Unable to handle kernel NULL
pointer dereference at 00000000000002c0 RIP:
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff8025a762>]
mutex_lock+0x0/0xb
Jul 6 09:48:09 just-read-the-instructions kernel: PGD 10ce07067 PUD 10ea3e067
PMD 0
Jul 6 09:48:09 just-read-the-instructions kernel: Oops: 0002 [1] SMP
Jul 6 09:48:09 just-read-the-instructions kernel: CPU 0
Jul 6 09:48:09 just-read-the-instructions kernel: Modules linked in: raid456
xor ipv6 iptable_mangle iptable_nat nf_nat xt_tcpudp nf_conntrack_ipv4 xt_state
nf_conntrack nfnetlink ipt_owner ipt_REJECT xt_limit ipt_LOG xt_hashlimit
ip6_tables ipt_addrtype iptable_filter ip_tables x_tables 8021q serio_raw
psmouse i2c_nforce2 shpchp i2c_core pci_hotplug pcspkr k8temp sg sr_mod cdrom
joydev evdev ext3 jbd mbcache dm_mirror dm_snapshot dm_mod raid1 md_mod
ide_generic ata_generic sata_nv libata sd_mod generic usb_storage usbhid hid
mptsas mptscsih mptbase scsi_transport_sas amd74xx e1000 scsi_mod forcedeth
ide_core ohci_hcd ehci_hcd thermal processor fan
Jul 6 09:48:09 just-read-the-instructions kernel: Pid: 14, comm: events/0 Not
tainted 2.6.21-1-amd64 #1
Jul 6 09:48:09 just-read-the-instructions kernel: RIP:
0010:[<ffffffff8025a762>] [<ffffffff8025a762>] mutex_lock+0x0/0xb
Jul 6 09:48:09 just-read-the-instructions kernel: RSP: 0018:ffff810120201c88
EFLAGS: 00010246
Jul 6 09:48:09 just-read-the-instructions kernel: RAX: 0000000000000000 RBX:
ffff81011b1f3000 RCX: 0000000000000000
Jul 6 09:48:09 just-read-the-instructions kernel: RDX: ffff81011a9784c0 RSI:
ffff81011b1f3000 RDI: 00000000000002c0
Jul 6 09:48:09 just-read-the-instructions kernel: RBP: 0000000000000004 R08:
000000000000000c R09: ffff81011b3392a0
Jul 6 09:48:09 just-read-the-instructions kernel: R10: 00000000fffffff4 R11:
ffff810120201ca8 R12: 0000000000000000
Jul 6 09:48:09 just-read-the-instructions kernel: R13: 00000000000002c0 R14:
0000000000000000 R15: 00000000000005b0
Jul 6 09:48:09 just-read-the-instructions kernel: FS: 00002b86508c56d0(0000)
GS:ffffffff804d9000(0000) knlGS:0000000000000000
Jul 6 09:48:09 just-read-the-instructions kernel: CS: 0010 DS: 0018 ES: 0018
CR0: 000000008005003b
Jul 6 09:48:09 just-read-the-instructions kernel: CR2: 00000000000002c0 CR3:
0000000101c9e000 CR4: 00000000000006e0
Jul 6 09:48:09 just-read-the-instructions kernel: Process events/0 (pid: 14,
threadinfo ffff810120200000, task ffff81011c0d2100)
Jul 6 09:48:09 just-read-the-instructions kernel: Stack: ffffffff880b8cca
ffff81011bcb29c0 ffff81011bc8a000 ffff81011ad99d80
Jul 6 09:48:09 just-read-the-instructions kernel: ffffffff880dc789
ffff81011bc8a5e8 ffff810120201cb8 ffff810120201cb8
Jul 6 09:48:09 just-read-the-instructions kernel: 0000000000000000
0000000000000000 ffff81011bc8a000 ffff81011ad99d80
Jul 6 09:48:09 just-read-the-instructions kernel: Call Trace:
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880b8cca>]
:scsi_transport_sas:sas_port_delete_phy+0x1a/0x5e
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880dc789>]
:mptsas:mptsas_setup_wide_ports+0x72/0x20d
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880dd097>]
:mptsas:mptsas_probe_expander_phys+0x3d0/0x427
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880c7265>]
:mptbase:mpt_timer_expired+0x0/0x24
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880dd969>]
:mptsas:__mptsas_discovery_work+0x16f/0x18a
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880dd984>]
:mptsas:mptsas_discovery_work+0x0/0x39
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff880dd9a8>]
:mptsas:mptsas_discovery_work+0x24/0x39
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff80246fa3>]
run_workqueue+0x8f/0x137
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff80243bcf>]
worker_thread+0x0/0x14a
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff80243ce3>]
worker_thread+0x114/0x14a
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff8027a990>]
default_wake_function+0x0/0xe
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff8022f236>]
kthread+0xd1/0x100
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff80255f38>]
child_rip+0xa/0x12
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff8022f165>]
kthread+0x0/0x100
Jul 6 09:48:09 just-read-the-instructions kernel: [<ffffffff80255f2e>]
child_rip+0x0/0x12
Jul 6 09:48:09 just-read-the-instructions kernel:
Jul 6 09:48:09 just-read-the-instructions kernel:
Jul 6 09:48:09 just-read-the-instructions kernel: Code: f0 ff 0f 79 05 e8 27
01 00 00 c3 f0 ff 07 7f 05 e8 e9 00 00
Jul 6 09:48:09 just-read-the-instructions kernel: RIP [<ffffffff8025a762>]
mutex_lock+0x0/0xb
Jul 6 09:48:09 just-read-the-instructions kernel: RSP <ffff810120201c88>
Jul 6 09:48:09 just-read-the-instructions kernel: CR2: 00000000000002c0