So the question is if we can somehow figure what is triggering that behavior ... and if it was maybe a bug present in r22 which has since been fixed ...
Have you detected any pattern to the corruption ? Time, Place, File Metadata -tobi ----- On Jan 13, 2020, at 3:41 PM, Schweiss, Chip <[email protected]> wrote: > Jason, et all, > No faulted devices or errors. > This is definitely a ZFS bug. The files with nulls have no indication they > were > modified, but once they have the live filesystem and all subsequent snapshots > of the files are null filled. If this was caused by a normal file system > operation, the replicated copy would reflect the null filling also. As I > pointed out, I'm finding correct files on the replicated zfs folder at the > live > version when the source end is null filled. > I doubt setting the immutable attribute would help here. ZFS doesn't know the > file has changed. Even 'zfs diff' never lists these corrupted files between > snapshots, even though they are returning differently. > -Chip > On Fri, Jan 10, 2020 at 5:34 PM Jason King < [ > mailto:[email protected] > | [email protected] ] > wrote: >> Do subsequent snapshots show the empty files or is it just the live >> filesystem? >> Does fmadm faulty or fmdump or iostat -En show any errors? Typically if >> something’s wrong with zfs, it tends to complain pretty loudly about it — >> scrubs will show errors, or you’ll get a kernel panic or a hang. >> It almost looks like the FREESP fcntl has been run on the files — that or >> ftruncate() — though I think ftruncate() would reset the size to ‘0’. >> However I >> would think either should bump the mtime on the files on the dataset (though >> something could come in and change the atime and mtime). Has anything changed >> on the client side — ISTR you can send a FREESP request via NFS — if these >> empty files are occurring frequently enough, it might be worth trying to >> capture that just to make sure it’s not a client going haywire. >> It does seem strange the preferred I/O block size has changed from 128k to >> 275.5k(?) in the live system to the snapshot. I’m assuming the preferred I/O >> block size reported by stat(2) is the dataset block size, but I’d need to >> confirm that. >> It seems like these files are archival in nature (?) — i.e. at a certain >> point >> they’re no longer being updated and are just being stored. While it wouldn’t >> solve the problem, ZFS does support the immutable attribute (/usr/bin/chmod >> S+i >> file). You could try setting that on the file which should prevent any >> changes >> until the attribute is removed. There is also the ‘archive’ attribute which >> makes the file append only (/usr/bin/chmod S+a file). If it is a badly >> behaving >> application somewhere (or even an NFSv4 bug), it’s unlikely to support >> altering >> those attributes itself — so it might make the culprit stand out (by throwing >> read-only errors back to it). >> From: Schweiss, Chip [ mailto:[email protected] | <[email protected]> ] >> Reply: illumos-discuss [ mailto:[email protected] | >> <[email protected]> ] >> Date: January 10, 2020 at 1:09:18 PM >> To: [ mailto:[email protected] | [email protected] ] [ >> mailto:[email protected] | <[email protected]> ] >> Cc: [ mailto:[email protected] | [email protected] ] [ >> mailto:[email protected] >> | <[email protected]> ] , [ mailto:[email protected] | [email protected] ] [ >> mailto:[email protected] | <[email protected]> ] , [ >> mailto:[email protected] >> | [email protected] ] [ mailto:[email protected] | <[email protected]> ] >> Subject: [discuss] ZFS corruption: Null filled files >>> I have 7 OmniOSce r151022 systems that are experiencing corruption on >>> thousands >>> of files. The files' content is replaced with all NULLs. >>> * Files are being corrupted at rest. No change is happening to any file >>> stats >>> when this happens. >>> * Some files are still correct on the replicated copy. Replication is >>> run every >>> 10 minutes with incremental zfs send/receive. >>> * Most files have a clean version in the snapshot history. The date of >>> the >>> corruption is almost after the latest date in the file stats. >>> * All file work happens through NFS v4 from CentOS 6 & 7 clients >>> * The paths are consolidated into one path via NFS referral links >>> * I have not run scrub on the pools yet. They are all large and being >>> hit hard >>> as we hunt down all files with this problem. I have a resilver running >>> on one >>> pool. >>> Here's an analysis of one file. I chose a log file since the content is >>> human-readable: >>> The system with the primary pool: >>> Null filled: >>> # head /HCPintraDB00/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> # stat /HCPintraDB00/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> File: /HCPintraDB00/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> Size: 281070 Blocks: 1 IO Block: 131072 regular file >>> Device: 4350019h/70582297d Inode: 1043386 Links: 1 >>> Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN) >>> Access: 2019-10-29 07:29:18.399569973 -0500 >>> Modify: 2019-10-29 18:19:45.006930486 -0500 >>> Change: 2019-10-30 03:32:03.353868416 -0500 >>> Birth: 2019-10-29 07:29:18.399569973 -0500 >>> Found clean in the 2019-11-07 snapshot. This is the newest snapshot with >>> this >>> file correct. >>> # head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow3.humanconnectome.org/ | >>> https://intradb-shadow3.humanconnectome.org ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow3.nrg.mir:8080/ | http://intradb-shadow3.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow3.humanconnectome.org/ | >>> https://intradb-shadow3.humanconnectome.org ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ] >>> # stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> Size: 281070 Blocks: 35 IO Block: 281088 regular file >>> Device: 435011fh/70582559d Inode: 1043386 Links: 1 >>> Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN) >>> Access: 2019-10-29 07:29:18.399569973 -0500 >>> Modify: 2019-10-29 18:19:45.006930486 -0500 >>> Change: 2019-10-30 03:32:03.353868416 -0500 >>> Birth: 2019-10-29 07:29:18.399569973 -0500 >>> On the replicated pool: >>> # head /HCPintraDB00/ [ >>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 | >>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow3.humanconnectome.org/ | >>> https://intradb-shadow3.humanconnectome.org ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow3.nrg.mir:8080/ | http://intradb-shadow3.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> https://intradb-shadow3.humanconnectome.org/ | >>> https://intradb-shadow3.humanconnectome.org ] >>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias >>> for [ >>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ] >>> Can anyone help identify this ZFS bug? >>> It has been suggested by Tobi Oetiker to turn on zfs debug messages with >>> 'mdb >>> -ke ::zfs_dbgmsg'. These are heavily loaded systems, can this cause >>> performance >>> problems? >>> Any help appreciated. >>> -Chip >>> [ https://illumos.topicbox.com/latest | illumos ] / illumos-discuss / see [ >>> https://illumos.topicbox.com/groups/discuss | discussions ] + [ >>> https://illumos.topicbox.com/groups/discuss/members | participants ] + [ >>> https://illumos.topicbox.com/groups/discuss/subscription | delivery >>> options ] [ >>> https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M62eeb322b4f2561729a06460 >>> | Permalink ] ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M21ffe18c5fd758306d5f5ec1 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription
