I have a large JBOD attached to my server via an LSI SAS2308 PCI card(mpt2sas 
driver). I've got about 40 drives right now assembled into 4 Linux software 
RAID sets and I am using those RAID volumes as back end devices for GPFS. 
Everything was working fine about a week ago when I had 20 drives and 2 RAID 
volumes then I added 20 new disks, all the same model, and now I am frequently 
seeing all the devices behind the SAS card reporting device_blocked immediately 
followed by device_unblocked. These events are correlated with a period of many 
seconds of no data throughput. This is happening often enough to cause major 
throughput problems. I have seen similar problem in the past, but they were 
accompanied by some kind of disk specific error and I could fix the situation 
by removing the disk. In this case there are no other errors in any log besides 
the device_blocked and device_unblocked on every single device.
This system is not in production yet so I can blow it all away if I need to, 
but I really want to understand what is causing this so that if it does come 
back once we go into production I'll be able to fix it without major 
disruptions. I suspect there is a misbehaving drive, but there is nothing 
pointing to a single drive and I could be completely wrong about that. Does 
anybody have any clue where to look?

Here is what the error logs look like:

Jun 11 19:29:17 storage003 kernel: sd 6:0:0:0: device_blocked, handle(0x0016)
Jun 11 19:29:17 storage003 kernel: sd 6:0:1:0: device_blocked, handle(0x000b)
Jun 11 19:29:17 storage003 kernel: sd 6:0:2:0: device_blocked, handle(0x000c)
Jun 11 19:29:17 storage003 kernel: ses 6:0:3:0: device_blocked, handle(0x000e)
Jun 11 19:29:17 storage003 kernel: sd 6:0:4:0: device_blocked, handle(0x000f)
Jun 11 19:29:17 storage003 kernel: sd 6:0:5:0: device_blocked, handle(0x0010)
... Same thing for the rest of the devices on host6
Jun 11 19:29:18 storage003 kernel: sd 6:0:0:0: device_unblocked and set to 
running, handle(0x0016)
Jun 11 19:29:18 storage003 kernel: sd 6:0:1:0: device_unblocked and set to 
running, handle(0x000b)
Jun 11 19:29:18 storage003 kernel: sd 6:0:2:0: device_unblocked and set to 
running, handle(0x000c)
Jun 11 19:29:18 storage003 kernel: ses 6:0:3:0: device_unblocked and set to 
running, handle(0x000e)
Jun 11 19:29:18 storage003 kernel: sd 6:0:4:0: device_unblocked and set to 
running, handle(0x000f)
Jun 11 19:29:18 storage003 kernel: sd 6:0:5:0: device_unblocked and set to 
running, handle(0x0010)
... Same thing for the rest of the devices again.

Thanks,
Mike Robbert

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to