[ha-clusters-discuss] data corruption

Robert Milkowski Thu, 04 Feb 2010 20:18:23 +0000

I know about the "issue" with quorum and ZFS when given an entire disk.
The quorum was added *after* pool was created. Then even if it was not 
the case (it was) then it would not cause these additional checksum 
errors within the pool.


When it comes to scsi-2 reservation - the cluster is connected to Sun 
6540 disk arrays which have other clusters connected to it as well and 
working fine.

Because the corruption happen within the pool as well I'm pretty sure 
there is some issue with a driver and/or firmware or eventually a fc 
switch (less likely).

re cmm buffer - i actually intercepted writes to it when it was 
happening and it was complaining about wrong checksum as you can see in 
my original email. Unfortunately after last changes (adding different qd 
and then re-adding the original one, etc.) old entries are already gone...


thanks for looking into it

ps. and no, the pool has not been imported simultaneously on both nodes


On 04/02/2010 18:20, Ellard Roush wrote:
> Hi Robert,
>
> There is a common problem that affects people with
> your configuration.
>
> ZFS recommends that you place an entire disk in the zpool.
> When that happens ZFS formats the disk.
> The format operation destroys any quorum information on the disk.
> If you configured the disk as a quorum device and then added
> that disk to a ZFS zpool, this sequence of operations would
> cause the quorum information to be destroyed.
>
> You can use a disk in a zpool as a quorum device.
> The Sun Cluster documentation states that you must
> add the disk to the ZFS zpool first and then
> configure the disk as a quorum device.
>
> There are other possible issues. Sometimes the vendor of the storage
> device does not properly test the SCSI-2 Reservations or SCSI-3 PGR.
> Sun Cluster has an audit trail of all changes to information on the
> quorum device (plus other membership subsystem operation).
> Use mdb to look at the following memory resident debug print buffer
>
> mdb> *cmm_dbg_buf/s
>
> This will provide more information.
>
> Regards,
> Ellard
>
> On 02/04/10 08:33, Robert Milkowski wrote:
>> Hi,
>>
>> S10, SC3.2 + patches, Generic_142900-03, 2x T5220 with QLE2462 
>> connected to 6540s.
>>
>> We started to observe below messages yesterday at both nodes at the 
>> same time after several weeks of running:
>>
>> <pre>
>> XXX cl_runtime: [ID 856360 kern.warning] WARNING: QUORUM_GENERIC: 
>> quorum_read_keys error: Reading the registration keys failed on 
>> quorum device /dev/did/rdsk/d7s2 with error 22.
>> XXX cl_runtime: [ID 868277 kern.warning] WARNING: CMM: Erstwhile 
>> online quorum device /dev/did/rdsk/d7s2 (qid 1) is inaccessible now.
>>
>> d7 is a quorum device and it was marked by cluster as offline:
>>
>> # clq status
>>
>> === Cluster Quorum ===
>>
>> --- Quorum Votes Summary from latest node reconfiguration ---
>>
>>             Needed   Present   Possible
>>             ------   -------   --------
>>             2        3         3
>>
>>
>> --- Quorum Votes by Node (current status) ---
>>
>> Node Name             Present     Possible     Status
>> ---------             -------     --------     ------
>> XXXXXXXXXXXXXXX     1           1            Online
>> YYYYYYYYYYYYYYY     1           1            Online
>>
>>
>> --- Quorum Votes by Device (current status) ---
>>
>> Device Name       Present      Possible      Status
>> -----------       -------      --------      ------
>> d7                0            1             Offline
>>
>>
>>
>> By looking at the source code I found that the above message is 
>> printed from within quorum_device_generic_impl::quorum_read_keys() 
>> and it will only happen if quorum_pgre_key_read() returns with return 
>> code 22 (actually any other than 0 or EACCESS but we already know 
>> that the rc is 22 from the syslog message).
>>
>> Now quorum_pgre_key_read() calls quorum_scsi_sector_read() and passes 
>> its return code as its own.
>> The quorum_scsi_sector_read() can possibly return with error if 
>> quorum_ioctl_with_retries() return with error or if there is a 
>> checksum mismatch.
>>
>> This is the relevant source code:
>>     406 int
>>     407 quorum_scsi_sector_read(
>> [...]
>>     449     error = quorum_ioctl_with_retries(vnode_ptr, USCSICMD, 
>> (intptr_t)&ucmd,
>>     450 &retval);
>>     451     if (error != 0) {
>>     452         CMM_TRACE(("quorum_scsi_sector_read: ioctl USCSICMD "
>>     453             "returned error (%d).\n", error));
>>     454         kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
>>     455         return (error);
>>     456     }
>>     457     458     //
>>     459     // Calculate and compare the checksum if check_data is true.
>>     460     // Also, validate the pgres_id string at the beg of the 
>> sector.
>>     461     //
>>     462     if (check_data) {
>>     463         PGRE_CALCCHKSUM(chksum, sector, iptr);
>>     464     465         // Compare the checksum.
>>     466         if (PGRE_GETCHKSUM(sector) != chksum) {
>>     467             CMM_TRACE(("quorum_scsi_sector_read: "
>>     468                 "checksum mismatch.\n"));
>>     469             kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
>>     470             return (EINVAL);
>>     471         }
>>     472     473         //
>>     474         // Validate the PGRE string at the beg of the sector.
>>     475         // It should contain PGRE_ID_LEAD_STRING[1|2].
>>     476         //
>>     477         if ((os::strncmp((char *)sector->pgres_id, 
>> PGRE_ID_LEAD_STRING1,
>>     478             strlen(PGRE_ID_LEAD_STRING1)) != 0) &&
>>     479             (os::strncmp((char *)sector->pgres_id, 
>> PGRE_ID_LEAD_STRING2,
>>     480             strlen(PGRE_ID_LEAD_STRING2)) != 0)) {
>>     481             CMM_TRACE(("quorum_scsi_sector_read: pgre id "
>>     482                 "mismatch. The sector id is %s.\n",
>>     483                 sector->pgres_id));
>>     484             kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
>>     485             return (EINVAL);
>>     486         }
>>     487     488     }
>>     489     kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
>>     490     491     return (error);
>>     492 }
>>
>>
>>
>>  56  -> __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 
>> 6308555744942019 enter
>>  56    -> __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 
>> 6308555744957176 enter
>>  56 <- __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 
>> 6308555745089857 rc: 0
>>  56    -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745108310 enter
>>  56      -> __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 
>> 6308555745120941 enter
>>  56        -> __1cCosHsprintf6FpcpkcE_v_      6308555745134231 enter
>>  56 <- __1cCosHsprintf6FpcpkcE_v_      6308555745148729 rc: 
>> 2890607504684
>>  56 <- __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745162898 rc: 
>> 1886718112
>>  56 <- __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745175529 rc: 
>> 1886718112
>>  56 <- __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 
>> 6308555745188599 rc: 22
>>
>> From the above output we know that quorum_ioctl_with_retries() 
>> returns with 0 so it must be a checksum mismatch!
>> As CMM_TRACE() is being called above and there are two of them in the 
>> code lets check which one it is:
>>
>>  21  -> __1cNdbg_print_bufIdbprintf6MpcE_v_   6309628794339298 
>> CMM_DEBUG: quorum_scsi_sector_read: checksum mismatch.
>>
>>
>> So this is where it fails:
>>
>>     462     if (check_data) {
>>     463         PGRE_CALCCHKSUM(chksum, sector, iptr);
>>     464     465         // Compare the checksum.
>>     466         if (PGRE_GETCHKSUM(sector) != chksum) {
>>     467             CMM_TRACE(("quorum_scsi_sector_read: "
>>     468                 "checksum mismatch.\n"));
>>     469             kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
>>     470             return (EINVAL);
>>     471         }
>>
>>
>>
>> By adding another quorum device, them removing d7 and adding it again 
>> (and removing the extra one) everything came back to normal. However 
>> I wonder how did we end-up there? HBA? firmware? 6540's firmware? SC 
>> bug?
>>
>> # fcinfo hba-port -l
>> HBA Port WWN: 2100001b3291014c
>>     OS Device Name: /dev/cfg/c2
>>     Manufacturer: QLogic Corp.
>>     Model: 375-3356-02
>>     Firmware Version: 05.01.00
>>     FCode/BIOS Version:  BIOS: 2.10; fcode: 2.4; EFI: 2.4;
>>     Serial Number: 0402R00-0927731201
>>     Driver Name: qlc
>>     Driver Version: 20090519-2.31
>>     Type: N-port
>>     State: online
>>     Supported Speeds: 1Gb 2Gb 4Gb     Current Speed: 4Gb     Node 
>> WWN: 2000001b3291014c
>>     Link Error Statistics:
>>         Link Failure Count: 0
>>         Loss of Sync Count: 0
>>         Loss of Signal Count: 0
>>         Primitive Seq Protocol Error Count: 0
>>         Invalid Tx Word Count: 0
>>         Invalid CRC Count: 0
>> HBA Port WWN: 2101001b32b1014c
>>     OS Device Name: /dev/cfg/c3
>>     Manufacturer: QLogic Corp.
>>     Model: 375-3356-02
>>     Firmware Version: 05.01.00
>>     FCode/BIOS Version:  BIOS: 2.10; fcode: 2.4; EFI: 2.4;
>>     Serial Number: 0402R00-0927731201
>>     Driver Name: qlc
>>     Driver Version: 20090519-2.31
>>     Type: N-port
>>     State: online
>>     Supported Speeds: 1Gb 2Gb 4Gb     Current Speed: 4Gb     Node 
>> WWN: 2001001b32b1014c
>>     Link Error Statistics:
>>         Link Failure Count: 0
>>         Loss of Sync Count: 0
>>         Loss of Signal Count: 0
>>         Primitive Seq Protocol Error Count: 0
>>         Invalid Tx Word Count: 0
>>         Invalid CRC Count: 0
>>
>>
>> 142084-02 is applied and by a quick glance I can't see anything 
>> related to the above which might be addressed by 142084-03.
>>
>> Each 6540 presents one 2TB LUN and we are using ZFS to mirror between 
>> them. One of LUNs is used as the quorum device as well.
>> Since it looks like data was corrupted for quorum the pool itself 
>> might be affected as well so I run scrub and after couple of hours I 
>> got so far:
>>
>> # zpool status -v XXXX
>>   pool: XXXX
>>  state: DEGRADED
>> status: One or more devices has experienced an error resulting in data
>>     corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>>     entire pool from backup.
>>    see: http://www.sun.com/msg/ZFS-8000-8A
>>  scrub: scrub in progress for 2h29m, 56.94% done, 1h52m to go
>> config:
>>
>>     NAME                                       STATE     READ WRITE 
>> CKSUM
>>     XXXX                                       DEGRADED     0     
>> 0    14
>>       mirror                                   DEGRADED     0     
>> 0    28
>>         c4t600A0B800029AF0000006CD4486B3B05d0  DEGRADED     0     
>> 0    28  too many errors
>>         c4t600A0B800029B74600004255486B6A4Fd0  DEGRADED     0     
>> 0    28  too many errors
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>         /XXXX/XXXX/XXXXXXXX/YYYYYY.dbf
>>
>>
>> I can't see any other errors in the system nor in logs or from FMA. 
>> The HBA firmware seems to be the latest version as well.
>>
>> Because of the corruption within the zfs pool I think that while the 
>> issue manifested itself first as a problem with the quorum device it 
>> has rather nothing to do with the SC itself and data corruption is 
>> happening somewhere. The other interesting thing is that  so far all 
>> the corrupted blocks detected by ZFS were corrupted on both sides of 
>> the mirror. Since each side is a separate disk array I think the 
>> corruption must probably have originated on the server itself rather 
>> than on SAN or disk arrays. Now the HBA is a dual-ported card and 
>> both paths are used (MPxIO). The issue is also rather not caused by 
>> ZFS itself as it shouldn't have affect the SC keys on the quorum device.
>>
>>
>> Any ideas?
>> </pre>
>>
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss

[ha-clusters-discuss] data corruption

Reply via email to