I have 7 OmniOSce r151022 systems that are experiencing corruption on thousands of files. The files' content is replaced with all NULLs.
- Files are being corrupted at rest. No change is happening to any file stats when this happens. - Some files are still correct on the replicated copy. Replication is run every 10 minutes with incremental zfs send/receive. - Most files have a clean version in the snapshot history. The date of the corruption is almost after the latest date in the file stats. - All file work happens through NFS v4 from CentOS 6 & 7 clients - The paths are consolidated into one path via NFS referral links - I have not run scrub on the pools yet. They are all large and being hit hard as we hunt down all files with this problem. I have a resilver running on one pool. Here's an analysis of one file. I chose a log file since the content is human-readable: The system with the primary pool: Null filled: # head /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 # stat /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 File: /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 Size: 281070 Blocks: 1 IO Block: 131072 regular file Device: 4350019h/70582297d Inode: 1043386 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN) Access: 2019-10-29 07:29:18.399569973 -0500 Modify: 2019-10-29 18:19:45.006930486 -0500 Change: 2019-10-30 03:32:03.353868416 -0500 Birth: 2019-10-29 07:29:18.399569973 -0500 Found clean in the 2019-11-07 snapshot. This is the newest snapshot with this file correct. # head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow3.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 # stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 Size: 281070 Blocks: 35 IO Block: 281088 regular file Device: 435011fh/70582559d Inode: 1043386 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN) Access: 2019-10-29 07:29:18.399569973 -0500 Modify: 2019-10-29 18:19:45.006930486 -0500 Change: 2019-10-30 03:32:03.353868416 -0500 Birth: 2019-10-29 07:29:18.399569973 -0500 On the replicated pool: # head /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow3.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 Can anyone help identify this ZFS bug? It has been suggested by Tobi Oetiker to turn on zfs debug messages with 'mdb -ke ::zfs_dbgmsg'. These are heavily loaded systems, can this cause performance problems? Any help appreciated. -Chip ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M62eeb322b4f2561729a06460 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription
