I have 7 OmniOSce r151022 systems that are experiencing corruption on
thousands of files.   The files' content is replaced with all NULLs.

   - Files are being corrupted at rest.  No change is happening to any file
   stats when this happens.
   - Some files are still correct on the replicated copy.   Replication is
   run every 10 minutes with incremental zfs send/receive.
   - Most files have a clean version in the snapshot history.   The date of
   the corruption is almost after the latest date in the file stats.
   - All file work happens through NFS v4 from CentOS 6 & 7 clients
   - The paths are consolidated into one path via NFS referral links
   - I have not run scrub on the pools yet.   They are all large and being
   hit hard as we hunt down all files with this problem.   I have a resilver
   running on one pool.

Here's an analysis of one file.  I chose a log file since the content is
human-readable:

The system with the primary pool:
Null filled:

# head /HCPintraDB00/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29

# stat /HCPintraDB00/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  File: /HCPintraDB00/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  Size: 281070          Blocks: 1          IO Block: 131072 regular file
Device: 4350019h/70582297d      Inode: 1043386     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
 Birth: 2019-10-29 07:29:18.399569973 -0500

Found clean in the 2019-11-07 snapshot.  This is the newest snapshot with
this file correct.
# head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow3.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow2.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow2.nrg.mir:8080

# stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
  Size: 281070          Blocks: 35         IO Block: 281088 regular file
Device: 435011fh/70582559d      Inode: 1043386     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid: (60023/ UNKNOWN)
Access: 2019-10-29 07:29:18.399569973 -0500
Modify: 2019-10-29 18:19:45.006930486 -0500
Change: 2019-10-30 03:32:03.353868416 -0500
 Birth: 2019-10-29 07:29:18.399569973 -0500

On the replicated pool:

# head /HCPintraDB00/
xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow3.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow1.nrg.mir
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow1.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow2.nrg.mir:8080
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for https://intradb-shadow3.humanconnectome.org
2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 - Refreshing Alias
for http://intradb-shadow2.nrg.mir:8080

Can anyone help identify this ZFS bug?

It has been suggested by Tobi Oetiker to turn on zfs debug messages with
'mdb -ke ::zfs_dbgmsg'.   These are heavily loaded systems, can this cause
performance problems?

Any help appreciated.

-Chip

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M62eeb322b4f2561729a06460
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to