Re: [discuss] ZFS corruption: Null filled files

Tobias Oetiker Tue, 14 Jan 2020 02:07:09 -0800

So the question is if we can somehow figure what is triggering that behavior 
... and if it was maybe a bug present in r22 which has since been fixed ...


Have you detected any pattern to the corruption ? Time, Place, File Metadata 

-tobi 

----- On Jan 13, 2020, at 3:41 PM, Schweiss, Chip <[email protected]> wrote: 

> Jason, et all,
> No faulted devices or errors.
> This is definitely a ZFS bug. The files with nulls have no indication they 
> were
> modified, but once they have the live filesystem and all subsequent snapshots
> of the files are null filled. If this was caused by a normal file system
> operation, the replicated copy would reflect the null filling also. As I
> pointed out, I'm finding correct files on the replicated zfs folder at the 
> live
> version when the source end is null filled.
> I doubt setting the immutable attribute would help here. ZFS doesn't know the
> file has changed. Even 'zfs diff' never lists these corrupted files between
> snapshots, even though they are returning differently.
> -Chip
> On Fri, Jan 10, 2020 at 5:34 PM Jason King < [ 
> mailto:[email protected]
> | [email protected] ] > wrote:
>> Do subsequent snapshots show the empty files or is it just the live 
>> filesystem?
>> Does fmadm faulty or fmdump or iostat -En show any errors? Typically if
>> something’s wrong with zfs, it tends to complain pretty loudly about it —
>> scrubs will show errors, or you’ll get a kernel panic or a hang.
>> It almost looks like the FREESP fcntl has been run on the files — that or
>> ftruncate() — though I think ftruncate() would reset the size to ‘0’. 
>> However I
>> would think either should bump the mtime on the files on the dataset (though
>> something could come in and change the atime and mtime). Has anything changed
>> on the client side — ISTR you can send a FREESP request via NFS — if these
>> empty files are occurring frequently enough, it might be worth trying to
>> capture that just to make sure it’s not a client going haywire.
>> It does seem strange the preferred I/O block size has changed from 128k to
>> 275.5k(?) in the live system to the snapshot. I’m assuming the preferred I/O
>> block size reported by stat(2) is the dataset block size, but I’d need to
>> confirm that.
>> It seems like these files are archival in nature (?) — i.e. at a certain 
>> point
>> they’re no longer being updated and are just being stored. While it wouldn’t
>> solve the problem, ZFS does support the immutable attribute (/usr/bin/chmod 
>> S+i
>> file). You could try setting that on the file which should prevent any 
>> changes
>> until the attribute is removed. There is also the ‘archive’ attribute which
>> makes the file append only (/usr/bin/chmod S+a file). If it is a badly 
>> behaving
>> application somewhere (or even an NFSv4 bug), it’s unlikely to support 
>> altering
>> those attributes itself — so it might make the culprit stand out (by throwing
>> read-only errors back to it).
>> From: Schweiss, Chip [ mailto:[email protected] | <[email protected]> ]
>> Reply: illumos-discuss [ mailto:[email protected] |
>> <[email protected]> ]
>> Date: January 10, 2020 at 1:09:18 PM
>> To: [ mailto:[email protected] | [email protected] ] [
>> mailto:[email protected] | <[email protected]> ]
>> Cc: [ mailto:[email protected] | [email protected] ] [ 
>> mailto:[email protected]
>> | <[email protected]> ] , [ mailto:[email protected] | [email protected] ] [
>> mailto:[email protected] | <[email protected]> ] , [ 
>> mailto:[email protected]
>> | [email protected] ] [ mailto:[email protected] | <[email protected]> ]
>> Subject: [discuss] ZFS corruption: Null filled files
>>> I have 7 OmniOSce r151022 systems that are experiencing corruption on 
>>> thousands
>>> of files. The files' content is replaced with all NULLs.
>>>    * Files are being corrupted at rest. No change is happening to any file 
>>> stats
>>>     when this happens.
>>>    * Some files are still correct on the replicated copy. Replication is 
>>> run every
>>>     10 minutes with incremental zfs send/receive.
>>>    * Most files have a clean version in the snapshot history. The date of 
>>> the
>>>     corruption is almost after the latest date in the file stats.
>>>     * All file work happens through NFS v4 from CentOS 6 & 7 clients
>>>     * The paths are consolidated into one path via NFS referral links
>>>    * I have not run scrub on the pools yet. They are all large and being 
>>> hit hard
>>>    as we hunt down all files with this problem. I have a resilver running 
>>> on one
>>>     pool.
>>> Here's an analysis of one file. I chose a log file since the content is
>>> human-readable:
>>> The system with the primary pool:
>>> Null filled:
>>> # head /HCPintraDB00/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> # stat /HCPintraDB00/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> File: /HCPintraDB00/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> Size: 281070 Blocks: 1 IO Block: 131072 regular file
>>> Device: 4350019h/70582297d Inode: 1043386 Links: 1
>>> Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN)
>>> Access: 2019-10-29 07:29:18.399569973 -0500
>>> Modify: 2019-10-29 18:19:45.006930486 -0500
>>> Change: 2019-10-30 03:32:03.353868416 -0500
>>> Birth: 2019-10-29 07:29:18.399569973 -0500
>>> Found clean in the 2019-11-07 snapshot. This is the newest snapshot with 
>>> this
>>> file correct.
>>> # head /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow3.humanconnectome.org/ |
>>> https://intradb-shadow3.humanconnectome.org ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow3.nrg.mir:8080/ | http://intradb-shadow3.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow3.humanconnectome.org/ |
>>> https://intradb-shadow3.humanconnectome.org ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ]
>>> # stat /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> File: /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> Size: 281070 Blocks: 35 IO Block: 281088 regular file
>>> Device: 435011fh/70582559d Inode: 1043386 Links: 1
>>> Access: (0644/-rw-r--r--) Uid: ( 603/ UNKNOWN) Gid: (60023/ UNKNOWN)
>>> Access: 2019-10-29 07:29:18.399569973 -0500
>>> Modify: 2019-10-29 18:19:45.006930486 -0500
>>> Change: 2019-10-30 03:32:03.353868416 -0500
>>> Birth: 2019-10-29 07:29:18.399569973 -0500
>>> On the replicated pool:
>>> # head /HCPintraDB00/ [
>>> http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 |
>>> xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow3.humanconnectome.org/ |
>>> https://intradb-shadow3.humanconnectome.org ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow3.nrg.mir:8080/ | http://intradb-shadow3.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow1.nrg.mir/ | https://intradb-shadow1.nrg.mir ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow1.nrg.mir:8080/ | http://intradb-shadow1.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> https://intradb-shadow3.humanconnectome.org/ |
>>> https://intradb-shadow3.humanconnectome.org ]
>>> 2019-10-29 07:29:18 INFO DefaultXsyncAliasRefresher:76 - Refreshing Alias 
>>> for [
>>> http://intradb-shadow2.nrg.mir:8080/ | http://intradb-shadow2.nrg.mir:8080 ]
>>> Can anyone help identify this ZFS bug?
>>> It has been suggested by Tobi Oetiker to turn on zfs debug messages with 
>>> 'mdb
>>> -ke ::zfs_dbgmsg'. These are heavily loaded systems, can this cause 
>>> performance
>>> problems?
>>> Any help appreciated.
>>> -Chip
>>> [ https://illumos.topicbox.com/latest | illumos ] / illumos-discuss / see [
>>> https://illumos.topicbox.com/groups/discuss | discussions ] + [
>>> https://illumos.topicbox.com/groups/discuss/members | participants ] + [
>>> https://illumos.topicbox.com/groups/discuss/subscription | delivery
>>> options ] [
>>> https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M62eeb322b4f2561729a06460
>>> | Permalink ]

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M21ffe18c5fd758306d5f5ec1
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Re: [discuss] ZFS corruption: Null filled files

Reply via email to