[ha-clusters-discuss] data corruption

Robert Milkowski Thu, 04 Feb 2010 08:33:58 PST

Hi,

S10, SC3.2 + patches, Generic_142900-03, 2x T5220 with QLE2462 connected to 
6540s.


We started to observe below messages yesterday at both nodes at the same time 
after several weeks of running:

<pre>
XXX cl_runtime: [ID 856360 kern.warning] WARNING: QUORUM_GENERIC: 
quorum_read_keys error: Reading the registration keys failed on quorum device 
/dev/did/rdsk/d7s2 with error 22.
XXX cl_runtime: [ID 868277 kern.warning] WARNING: CMM: Erstwhile online quorum 
device /dev/did/rdsk/d7s2 (qid 1) is inaccessible now.

d7 is a quorum device and it was marked by cluster as offline:

# clq status

=== Cluster Quorum ===

--- Quorum Votes Summary from latest node reconfiguration ---

            Needed   Present   Possible
            ------   -------   --------
            2        3         3


--- Quorum Votes by Node (current status) ---

Node Name             Present     Possible     Status
---------             -------     --------     ------
XXXXXXXXXXXXXXX     1           1            Online
YYYYYYYYYYYYYYY     1           1            Online


--- Quorum Votes by Device (current status) ---

Device Name       Present      Possible      Status
-----------       -------      --------      ------
d7                0            1             Offline



By looking at the source code I found that the above message is printed from 
within quorum_device_generic_impl::quorum_read_keys() and it will only happen 
if quorum_pgre_key_read() returns with return code 22 (actually any other than 
0 or EACCESS but we already know that the rc is 22 from the syslog message).

Now quorum_pgre_key_read() calls quorum_scsi_sector_read() and passes its 
return code as its own.
The quorum_scsi_sector_read() can possibly return with error if 
quorum_ioctl_with_retries() return with error or if there is a checksum 
mismatch.

This is the relevant source code:
    406 int
    407 quorum_scsi_sector_read(
[...]
    449         error = quorum_ioctl_with_retries(vnode_ptr, USCSICMD, 
(intptr_t)&ucmd,
    450             &retval);
    451         if (error != 0) {
    452                 CMM_TRACE(("quorum_scsi_sector_read: ioctl USCSICMD "
    453                     "returned error (%d).\n", error));
    454                 kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
    455                 return (error);
    456         }
    457 
    458         //
    459         // Calculate and compare the checksum if check_data is true.
    460         // Also, validate the pgres_id string at the beg of the sector.
    461         //
    462         if (check_data) {
    463                 PGRE_CALCCHKSUM(chksum, sector, iptr);
    464 
    465                 // Compare the checksum.
    466                 if (PGRE_GETCHKSUM(sector) != chksum) {
    467                         CMM_TRACE(("quorum_scsi_sector_read: "
    468                             "checksum mismatch.\n"));
    469                         kmem_free(ucmd.uscsi_rqbuf, 
(size_t)SENSE_LENGTH);
    470                         return (EINVAL);
    471                 }
    472 
    473                 //
    474                 // Validate the PGRE string at the beg of the sector.
    475                 // It should contain PGRE_ID_LEAD_STRING[1|2].
    476                 //
    477                 if ((os::strncmp((char *)sector->pgres_id, 
PGRE_ID_LEAD_STRING1,
    478                     strlen(PGRE_ID_LEAD_STRING1)) != 0) &&
    479                     (os::strncmp((char *)sector->pgres_id, 
PGRE_ID_LEAD_STRING2,
    480                     strlen(PGRE_ID_LEAD_STRING2)) != 0)) {
    481                         CMM_TRACE(("quorum_scsi_sector_read: pgre id "
    482                             "mismatch. The sector id is %s.\n",
    483                             sector->pgres_id));
    484                         kmem_free(ucmd.uscsi_rqbuf, 
(size_t)SENSE_LENGTH);
    485                         return (EINVAL);
    486                 }
    487 
    488         }
    489         kmem_free(ucmd.uscsi_rqbuf, (size_t)SENSE_LENGTH);
    490 
    491         return (error);
    492 }



 56  -> __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 
6308555744942019 enter
 56    -> __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555744957176 
enter
 56    <- __1cZquorum_ioctl_with_retries6FpnFvnode_ilpi_i_ 6308555745089857 rc: 0
 56    -> __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745108310 enter
 56      -> __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745120941 enter
 56        -> __1cCosHsprintf6FpcpkcE_v_      6308555745134231 enter
 56        <- __1cCosHsprintf6FpcpkcE_v_      6308555745148729 rc: 2890607504684
 56      <- __1cNdbg_print_bufLdbprintf_va6Mbpcrpv_v_ 6308555745162898 rc: 
1886718112
 56    <- __1cNdbg_print_bufIdbprintf6MpcE_v_ 6308555745175529 rc: 1886718112
 56  <- __1cXquorum_scsi_sector_read6FpnFvnode_LpnLpgre_sector_b_i_ 
6308555745188599 rc: 22

>From the above output we know that quorum_ioctl_with_retries() returns with 0 
>so it must be a checksum mismatch!
As CMM_TRACE() is being called above and there are two of them in the code lets 
check which one it is:

 21  -> __1cNdbg_print_bufIdbprintf6MpcE_v_   6309628794339298 CMM_DEBUG: 
quorum_scsi_sector_read: checksum mismatch.


So this is where it fails:

    462         if (check_data) {
    463                 PGRE_CALCCHKSUM(chksum, sector, iptr);
    464 
    465                 // Compare the checksum.
    466                 if (PGRE_GETCHKSUM(sector) != chksum) {
    467                         CMM_TRACE(("quorum_scsi_sector_read: "
    468                             "checksum mismatch.\n"));
    469                         kmem_free(ucmd.uscsi_rqbuf, 
(size_t)SENSE_LENGTH);
    470                         return (EINVAL);
    471                 }



By adding another quorum device, them removing d7 and adding it again (and 
removing the extra one) everything came back to normal. However I wonder how 
did we end-up there? HBA? firmware? 6540's firmware? SC bug?

# fcinfo hba-port -l
HBA Port WWN: 2100001b3291014c
        OS Device Name: /dev/cfg/c2
        Manufacturer: QLogic Corp.
        Model: 375-3356-02
        Firmware Version: 05.01.00
        FCode/BIOS Version:  BIOS: 2.10; fcode: 2.4; EFI: 2.4;
        Serial Number: 0402R00-0927731201
        Driver Name: qlc
        Driver Version: 20090519-2.31
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb 
        Current Speed: 4Gb 
        Node WWN: 2000001b3291014c
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0
HBA Port WWN: 2101001b32b1014c
        OS Device Name: /dev/cfg/c3
        Manufacturer: QLogic Corp.
        Model: 375-3356-02
        Firmware Version: 05.01.00
        FCode/BIOS Version:  BIOS: 2.10; fcode: 2.4; EFI: 2.4;
        Serial Number: 0402R00-0927731201
        Driver Name: qlc
        Driver Version: 20090519-2.31
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb 
        Current Speed: 4Gb 
        Node WWN: 2001001b32b1014c
        Link Error Statistics:
                Link Failure Count: 0
                Loss of Sync Count: 0
                Loss of Signal Count: 0
                Primitive Seq Protocol Error Count: 0
                Invalid Tx Word Count: 0
                Invalid CRC Count: 0


142084-02 is applied and by a quick glance I can't see anything related to the 
above which might be addressed by 142084-03.

Each 6540 presents one 2TB LUN and we are using ZFS to mirror between them. One 
of LUNs is used as the quorum device as well.
Since it looks like data was corrupted for quorum the pool itself might be 
affected as well so I run scrub and after couple of hours I got so far:

# zpool status -v XXXX
  pool: XXXX
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 2h29m, 56.94% done, 1h52m to go
config:

        NAME                                       STATE     READ WRITE CKSUM
        XXXX                                       DEGRADED     0     0    14
          mirror                                   DEGRADED     0     0    28
            c4t600A0B800029AF0000006CD4486B3B05d0  DEGRADED     0     0    28  
too many errors
            c4t600A0B800029B74600004255486B6A4Fd0  DEGRADED     0     0    28  
too many errors

errors: Permanent errors have been detected in the following files:

        /XXXX/XXXX/XXXXXXXX/YYYYYY.dbf


I can't see any other errors in the system nor in logs or from FMA. The HBA 
firmware seems to be the latest version as well.

Because of the corruption within the zfs pool I think that while the issue 
manifested itself first as a problem with the quorum device it has rather 
nothing to do with the SC itself and data corruption is happening somewhere. 
The other interesting thing is that  so far all the corrupted blocks detected 
by ZFS were corrupted on both sides of the mirror. Since each side is a 
separate disk array I think the corruption must probably have originated on the 
server itself rather than on SAN or disk arrays. Now the HBA is a dual-ported 
card and both paths are used (MPxIO). The issue is also rather not caused by 
ZFS itself as it shouldn't have affect the SC keys on the quorum device.


Any ideas?
</pre>

-- 
Robert Milkowski
http://milek.blogspot.com
-- 
This message posted from opensolaris.org

[ha-clusters-discuss] data corruption

Reply via email to