Do subsequent snapshots show the empty files or is it just the live filesystem?
Does fmadm faulty or fmdump or iostat -En show any errors? Typically if something’s wrong with zfs, it tends to complain pretty loudly about it — scrubs will show errors, or you’ll get a kernel panic or a hang. It almost looks like the FREESP fcntl has been run on the files — that or ftruncate() — though I think ftruncate() would reset the size to ‘0’. However I would think either should bump the mtime on the files on the dataset (though something could come in and change the atime and mtime). Has anything changed on the client side — ISTR you can send a FREESP request via NFS — if these empty files are occurring frequently enough, it might be worth trying to capture that just to make sure it’s not a client going haywire. It does seem strange the preferred I/O block size has changed from 128k to 275.5k(?) in the live system to the snapshot. I’m assuming the preferred I/O block size reported by stat(2) is the dataset block size, but I’d need to confirm that. It seems like these files are archival in nature (?) — i.e. at a certain point they’re no longer being updated and are just being stored. While it wouldn’t solve the problem, ZFS does support the immutable attribute (/usr/bin/chmod S+i file). You could try setting that on the file which should prevent any changes until the attribute is removed. There is also the ‘archive’ attribute which makes the file append only (/usr/bin/chmod S+a file). If it is a badly behaving application somewhere (or even an NFSv4 bug), it’s unlikely to support altering those attributes itself — so it might make the culprit stand out (by throwing read-only errors back to it). From: Schweiss, Chip <[email protected]> <[email protected]> Reply: illumos-discuss <[email protected]> <[email protected]> Date: January 10, 2020 at 1:09:18 PM To: [email protected] <[email protected]> <[email protected]> Cc: [email protected] <[email protected]> <[email protected]>, [email protected] <[email protected]> <[email protected]>, [email protected] <[email protected]> <[email protected]> Subject: [discuss] ZFS corruption: Null filled files I have 7 OmniOSce r151022 systems that are experiencing corruption on thousands of files. The files' content is replaced with all NULLs. - Files are being corrupted at rest. No change is happening to any file stats when this happens. - Some files are still correct on the replicated copy. Replication is run every 10 minutes with incremental zfs send/receive. - Most files have a clean version in the snapshot history. The date of the corruption is almost after the latest date in the file stats. - All file work happens through NFS v4 from CentOS 6 & 7 clients - The paths are consolidated into one path via NFS referral links - I have not run scrub on the pools yet. They are all large and being hit hard as we hunt down all files with this problem. I have a resilver running on one pool. Here's an analysis of one file. I chose a log file since the content is human-readable: The system with the primary pool: Null filled: # head /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 # stat /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 File: /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 Size: 281070 Blocks: 1 IO Block: 131072 regular file Device: 4350019h/70582297d Inode: 1043386 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN) Access: 2019-10-29 07:29:18.399569973 -0500 Modify: 2019-10-29 18:19:45.006930486 -0500 Change: 2019-10-30 03:32:03.353868416 -0500 Birth: 2019-10-29 07:29:18.399569973 -0500 Found clean in the 2019-11-07 snapshot. This is the newest snapshot with this file correct. # head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow3.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 # stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 Size: 281070 Blocks: 35 IO Block: 281088 regular file Device: 435011fh/70582559d Inode: 1043386 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN) Access: 2019-10-29 07:29:18.399569973 -0500 Modify: 2019-10-29 18:19:45.006930486 -0500 Change: 2019-10-30 03:32:03.353868416 -0500 Birth: 2019-10-29 07:29:18.399569973 -0500 On the replicated pool: # head /HCPintraDB00/ xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow3.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow1.nrg.mir 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow1.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for https://intradb-shadow3.humanconnectome.org 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias for http://intradb-shadow2.nrg.mir:8080 Can anyone help identify this ZFS bug? It has been suggested by Tobi Oetiker to turn on zfs debug messages with 'mdb -ke ::zfs_dbgmsg'. These are heavily loaded systems, can this cause performance problems? Any help appreciated. -Chip *illumos <https://illumos.topicbox.com/latest>* / illumos-discuss / see discussions <https://illumos.topicbox.com/groups/discuss> + participants <https://illumos.topicbox.com/groups/discuss/members> + delivery options <https://illumos.topicbox.com/groups/discuss/subscription> Permalink <https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M62eeb322b4f2561729a06460> ------------------------------------------ illumos: illumos-discuss Permalink: https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M4feef46af8066f0f9b7a8883 Delivery options: https://illumos.topicbox.com/groups/discuss/subscription
