storage failure

rysanek Wed, 10 Sep 2003 02:13:23 -0700

Dear Mr. Long,

thanks a lot for taking the time to respond
- especially given that you're on vaccations and
that it's almost 2 a.m. your time.


I apologize for having used vague formulations in my
past mail. Also, perhaps I have made up wrong meanings
for some vocabulary occuring in the driver's code.
Specifically:

> 2. What is a zero-padded FIB?  I concede that the AIF handling in the driver
> is sub-par and needs to be revisited, so I'd like to know what you are seeing.
>
I was referring to this:

aac_dequeue_fib: called
aac0: aac_host_command: FIB @ 0xe1984000
aac0:   XferState 0
aac0:   Command       0
aac0:   StructType    0
aac0:   Flags         0x0
aac0:   Size          0
aac0:   SenderSize    0
aac0:   SenderAddress 0x0
aac0:   RcvrAddress   0x0
aac0:   SenderData    0x0
aac0:    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
aac0:    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
aac0: unknown command from controller

The size is a zero, the data dump at the end contains all
zeroes. That's why I called it a "zero-padded FIB".

This only occurs when an "unhandled array failure" arrives
- when the machine is about to hang upon runtime array degradation or
at boot from a degraded array. Which normally only happens
with SMP+APIC_IO enabled. Not with a UP kernel.

All the other FIB listings that I've seen contain some
non-zero data and claim non-zero length...

In my last message, I have attached a tarball with some logs.
To see what I'm talking about, please take a look at this:

- runtime array degradation - compare the two logs:
 - SMP, unrecoverable failure:
    logs/DEBUG_CAM_AAC_L2/SMP-2_disk_failed (line 23)
 - UP, system keeps going just fine:
    logs/DEBUG_CAM_AAC_L2/NOSMP-2_disk_failed (line 76)

- boot from a degraded array - compare the two logs:
 - SMP, unrecoverable failure
    logs/DEBUG_AAC_L4/SMP-3_boot_with_degraded_array_failed (line 296)
 - UP, system boots just fine:
    logs/DEBUG_AAC_L4/NOSMP-3_boot_with_degraded_array_OK (line 273)

> 3. The split and corrupted messages on the console were likely due to
> kernel printfs happening from different contexts at the same time.  The
> printf facility has no serializing ability, unfortunately.
>
OK.

> 4. I'm unclear on what you mean by there being a problem in the
> asynchronous handling of device printfs and host command fibs.  I'd be
> very interested in more information on this.
>
I didn't mean to say that there was the cause of my problem in that area.

I meant to say that I have a problem understanding what's going on.
I'm not a skilled coder, I have a hard time understanding how the
driver's code works. I am able to see where a function is called
with some arguments and returns with a result. However, when a SCSI
command is issued to the controller, at the driver level the
request/response doesn't happen within a single function.
One function queues the command to the controller via the MMIO (?)
region and the response from the controller eventually comes
back within an interrupt, invoking an interrupt handler.
The response may be a valid SCSI response to the SCSI command,
or a "something went wrong" **Monitor** event.
I am vaguely aware that the SCSI controller can reorder commands in the
queue or process them out of order.
Combine this with the unserialized logging and I'm lost :)
Sorry.

If there's something specific I should check for, please let me know.
Thanks for being patient with my hasty descriptions :)

Frank Rysanek

_______________________________________________
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: FreeBSD 4.8, ASR2120, SMP, degraded RAID1/mirror => storage failure

Reply via email to