Bob Friesenhahn wrote:
> On Mon, 28 Jul 2008, Richard Elling wrote:
>>
>> But ZFS can do better.  I filed CR6674679 which basically says
>> that if redundant copies of data have the same, wrong checksum,
>> then ZFS should issue an e-report to that effect.  This will allow
>> you to move suspicion away from the disks as a root cause towards
>> a  common cause, like memory, shared HBA or bus, etc. It won't
>> be able to recover the data, but it can help debug the system.
>
> A rather obvious thing to do is to have a low-priority task running 
> which validates checksums of memory in the ZFS ARC.  That way memory 
> content which is somehow altered (due to memory glitch or kernel bug) 
> will be detected so someone can fix the problem.  Even ECC memory will 
> not fix the problem when an adaptor card writes to the wrong location, 
> or a device driver does something wrong.

We already have memory scrubbers which check memory.  Actually,
we've had these for about 10 years, but it only works for ECC
memory... if you have only parity memory, then you can't fix anything
at the hardware level, and the best you can hope is that FMA will do
the right thing.

It is not clear to me where ARC validation occurs.  Perhaps someone
who deals with the ARC code could shed some light.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to