There has been discussion recently on the linux-scsi list about devices
being put offline incorrectly. There is such a problem with error
handling in v2.4.0 (and v2.2.18) for low-level drivers that use the new
error handling. The SCSI mid-level new error handling incorrectly puts
devices offline after a reset when they should stay online. (This doesn't
apply to the sym53c8xx driver which has been discussed most recently since
it it appears to use the old error handling in v2.4.0.)
Here is a list of drivers in v2.4.0 that use new error handling (Only 10/~60):
3w_xxx, advansys, aha1542, dc390, eata, gdth, imm, ips, sim710, u1434f
As well as keeping devices online after a reset if they are still working
the patch also prints a message to the console when devices are taken
offline. Previously the SCSI mid-level issued no message when taking
devices offline which made administering this action problematic. Some
log files and the patch are below.
Eric let me know if this conflicts with what you have. Doug said that it
needs to be merged with James's reset and reservation conflict patch. Will
that ever get into the mainstream kernel - I would like to see it included?
This doesn't fix all of the problems with the new error handling (based on
the number of remaining FIXMEs), but for me it takes care of an important one.
Change Explanation:
1. The important change is in scsi_error.c:scsi_eh_completed_normally().
Previously this function returned FAILED if scsi_check_sense()
returned NEEDS_RETRY. This isn't right because after a SCSI bus
reset it is normal for a device to return Check Condition status
with Unit Attention. Returning FAILED results in scsi_test_unit_ready()
returning FAILED which causes scsi_unjam_host() to put the device
offline.
scsi_send_eh_cmnd() is the only function that calls
scsi_eh_completed_normally(). It is OK to return NEEDS_RETRY to the
function because it already has code to handle this return.
2. The other noteable change is in scsi_send_eh_cmnd(). For the
(!host->can_queue) case this code was used which is bogus.
scsi_eh_completed_normally always returns !0 - FAILED and
SUCCESS are both non-zero.
> if (scsi_eh_completed_normally(SCpnt)) {
Instead eh_state is set to SUCCESS and the status is looked
at in code below.
3. The other changes add logging messages to make it clear
what happens in the error handling when logging is turned-on.
---------------------------------------------------------------------------
BEFORE:
This uses a debugging advansys driver. Writing to the /proc/scsi/advansys/0
file causes "commands-dropping" to be toggled.
Script started on Thu Jan 18 11:12:36 2001
# Device works.
$ dd if=/dev/sda of=/dev/null count=1
1+0 records in
1+0 records out
# Start dropping commands.
$ echo 1 > /proc/scsi/advansys/0
# Send 1 command and stop dropping commands. After bus reset command
# retry is never attemped because first the TUR fails and the SCSI
# mid-level puts the device offline.
$ dd if=/dev/sda of=/dev/null count=1 & echo 1 > /proc/scsi/advansys/0
<console log> advansys: advansys_reset: board 0: SCSI bus reset started...
<console log> advansys: advansys_reset: board 0: SCSI bus reset successful
dd: /dev/sda: Input/output error
0+0 records in
0+0 records out
$ dd if=/dev/sda of=/dev/null count=1
dd: /dev/sda: Device not configured
# Device has to manually be brought back online.
Script done on Thu Jan 18 11:14:56 2001
---------------------------------------------------------------------------
AFTER:
Script started on Thu Jan 18 11:03:38 2001
Test #1:
# Device works.
$ dd if=/dev/sda of=/dev/null count=1
1+0 records in
1+0 records out
# Start dropping commands.
$ echo 1 > /proc/scsi/advansys/0
# Send 1 command and stop dropping commands. After bus reset command
# TUR is succseful after a retry, command is retried and completes
# successfully.
$ dd if=/dev/sda of=/dev/null count=1 & echo 1 > /proc/scsi/advansys/0
<console log> advansys: advansys_reset: board 0: SCSI bus reset started...
<console log> advansys: advansys_reset: board 0: SCSI bus reset successful
1+0 records in
1+0 records out
# Device continues to work.
$ dd if=/dev/sda of=/dev/null count=1
1+0 records in
1+0 records out
Test #2:
# Start dropping commands.
$ echo 1 > /proc/scsi/advansys/0
# Send 1 command. After bus reset commands is retried and failed and
# device doesn't respond. Device is set offline.
$ dd if=/dev/sda of=/dev/null count=1
<console log> advansys: advansys_reset: board 0: SCSI bus reset started...
<console log> advansys: advansys_reset: board 0: SCSI bus reset successful
<console log> scsi: device set offline - not ready or command retry failed after bus
reset: host 0 channel 0 id 1 lun 0
dd: /dev/sda: Input/output error
0+0 records in
0+0 records out
# Device is offline.
$ dd if=/dev/sda of=/dev/null count=1
dd: /dev/sda: Device not configured
# Stop dropping commands.
$ echo 1 > /proc/scsi/advansys/0
# Manually restore device by removing and adding it.
$ echo "scsi remove-single-device 0 0 1 0" > /proc/scsi/scsi
$ echo "scsi add-single-device 0 0 1 0" > /proc/scsi/scsi
# Device starts working again.
$ dd if=/dev/sda of=/dev/null count=1
1+0 records in
1+0 records out
Script done on Thu Jan 18 11:10:11 2001
---------------------------------------------------------------------------
--- scsi_error.c-ORIG Wed Jan 17 14:01:24 2001
+++ scsi_error.c Thu Jan 18 11:02:03 2001
@@ -549,6 +549,9 @@
/*
* Hey, we are done. Let's look to see what happened.
*/
+ SCSI_LOG_ERROR_RECOVERY(3,
+ printk("scsi_test_unit_ready: SCpnt %p eh_state %x\n",
+ SCpnt, SCpnt->eh_state));
return SCpnt->eh_state;
}
@@ -671,11 +674,8 @@
spin_unlock_irqrestore(&io_request_lock, flags);
SCpnt->result = temp;
- if (scsi_eh_completed_normally(SCpnt)) {
- SCpnt->eh_state = SUCCESS;
- } else {
- SCpnt->eh_state = FAILED;
- }
+ /* Fall through to code below to examine status. */
+ SCpnt->eh_state = SUCCESS;
}
/*
@@ -683,7 +683,10 @@
* did complete normally.
*/
if (SCpnt->eh_state == SUCCESS) {
- switch (scsi_eh_completed_normally(SCpnt)) {
+ int ret = scsi_eh_completed_normally(SCpnt);
+ SCSI_LOG_ERROR_RECOVERY(3,
+ printk("scsi_send_eh_cmnd: scsi_eh_completed_normally %x\n",
+ret));
+ switch (ret) {
case SUCCESS:
SCpnt->eh_state = SUCCESS;
break;
@@ -1104,7 +1107,6 @@
*/
STATIC int scsi_eh_completed_normally(Scsi_Cmnd * SCpnt)
{
- int rtn;
/*
* First check the host byte, to see if there is anything in there
* that would indicate what we need to do.
@@ -1144,11 +1146,7 @@
case COMMAND_TERMINATED:
return SUCCESS;
case CHECK_CONDITION:
- rtn = scsi_check_sense(SCpnt);
- if (rtn == NEEDS_RETRY) {
- return FAILED;
- }
- return rtn;
+ return scsi_check_sense(SCpnt);
case CONDITION_GOOD:
case INTERMEDIATE_GOOD:
case INTERMEDIATE_C_GOOD:
@@ -1634,8 +1632,10 @@
* FIXME(eric) - is this really the
correct thing to do?
*/
if (rtn != SUCCESS) {
- SCloop->device->online = FALSE;
- SCloop->host->host_failed--;
+ printk(KERN_INFO "scsi: device
+set offline - not ready or command retry failed after bus reset: host %d channel %d
+id %d lun %d\n", SDloop->host->host_no, SDloop->channel, SDloop->id, SDloop->lun);
+
+ SDloop->online = FALSE;
+ SDloop->host->host_failed--;
scsi_eh_finish_command(&SCdone, SCloop);
}
}
@@ -1725,8 +1725,9 @@
}
}
if (rtn != SUCCESS) {
- SCloop->device->online = FALSE;
- SCloop->host->host_failed--;
+ printk(KERN_INFO "scsi: device
+set offline - not ready or command retry failed after host reset: host %d channel %d
+id %d lun %d\n", SDloop->host->host_no, SDloop->channel, SDloop->id, SDloop->lun);
+ SDloop->online = FALSE;
+ SDloop->host->host_failed--;
scsi_eh_finish_command(&SCdone, SCloop);
}
}
@@ -1753,7 +1754,11 @@
for (SDpnt = host->host_queue; SDpnt; SDpnt = SDpnt->next) {
for (SCloop = SDpnt->device_queue; SCloop; SCloop = SCloop->next) {
if (SCloop->state == SCSI_STATE_FAILED || SCloop->state ==
SCSI_STATE_TIMEOUT) {
- SCloop->device->online = FALSE;
+ SDloop = SCloop->device;
+ if (SDloop->online == TRUE) {
+ printk(KERN_INFO "scsi: device set offline -
+command error recover failed: host %d channel %d id %d lun %d\n",
+SDloop->host->host_no, SDloop->channel, SDloop->id, SDloop->lun);
+ SDloop->online = FALSE;
+ }
/*
* This should pass the failure up to the top level
driver, and
@@ -1765,7 +1770,7 @@
SCloop->result |= (DRIVER_TIMEOUT << 24);
}
SCSI_LOG_ERROR_RECOVERY(3, printk("Finishing command
for device %d %x\n",
- SCloop->device->id, SCloop->result));
+ SDloop->id, SCloop->result));
scsi_eh_finish_command(&SCdone, SCloop);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]