A shot in the dark maybe but the most recent known Bug to cause
corruption was the ZFS TRIM code. Related Review Requests are 12051,
11714 and 12070 Work by Jerry Jelinek. Could it be that not all TRIM
enabling and disabling code got pulled correctly into OmniOS?

Greetings
Till

On 14.01.20 11:05, Tobias Oetiker wrote:
> So the question is if we can somehow figure what is triggering that
> behavior ... and if it was maybe a bug present in r22 which has since
> been fixed ...
> 
> Have you detected any pattern to the corruption ? Time, Place, File Metadata
> 
> -tobi
> 
> ----- On Jan 13, 2020, at 3:41 PM, Schweiss, Chip <[email protected]>
> wrote:
> 
>     Jason, et all,
> 
>     No faulted devices or errors.   
> 
>     This is definitely a ZFS bug.  The files with nulls have no
>     indication they were modified, but once they have the live
>     filesystem and all subsequent snapshots of the files are null
>     filled.   If this was caused by a normal file system operation, the
>     replicated copy would reflect the null filling also.  As I pointed
>     out, I'm finding correct files on the replicated zfs folder at the
>     live version when the source end is null filled.
> 
>     I doubt setting the immutable attribute would help here.  ZFS
>     doesn't know the file has changed.  Even 'zfs diff' never lists
>     these corrupted files between snapshots, even though they are
>     returning differently.   
> 
>     -Chip
> 
> 
>     On Fri, Jan 10, 2020 at 5:34 PM Jason King
>     <[email protected] <mailto:[email protected]>> wrote:
> 
>         Do subsequent snapshots show the empty files or is it just the
>         live filesystem?
> 
>         Does fmadm faulty or fmdump or iostat -En show any errors? 
>         Typically if something’s wrong with zfs, it tends to complain
>         pretty loudly about it — scrubs will show errors, or you’ll get
>         a kernel panic or a hang.  
>          
>         It almost looks like the FREESP fcntl has been run on the files
>         — that or ftruncate() — though I think ftruncate() would reset
>         the size to ‘0’.  However I would think either should bump the
>         mtime on the files on the dataset (though something could come
>         in and change the atime and mtime).  Has anything changed on the
>         client side — ISTR you can send a FREESP request via NFS — if
>         these empty files are occurring frequently enough, it might be
>         worth trying to capture that just to make sure it’s not a client
>         going haywire.
> 
>         It does seem strange the preferred I/O block size has changed
>         from 128k to 275.5k(?) in the live system to the snapshot. I’m
>         assuming the preferred I/O block size reported by stat(2) is the
>         dataset block size, but I’d need to confirm that.
> 
>         It seems like these files are archival in nature (?) — i.e. at a
>         certain point they’re no longer being updated and are just being
>         stored.  While it wouldn’t solve the problem, ZFS does support
>         the immutable attribute (/usr/bin/chmod S+i file).  You could
>         try setting that on the file which should prevent any changes
>         until the attribute is removed.  There is also the ‘archive’
>         attribute which makes the file append only (/usr/bin/chmod S+a
>         file). If it is a badly behaving application somewhere (or even
>         an NFSv4 bug), it’s unlikely to support altering those
>         attributes itself — so it might make the culprit stand out (by
>         throwing read-only errors back to it).
> 
> 
>         From: Schweiss, Chip <[email protected]>
>         <mailto:[email protected]>
>         Reply: illumos-discuss <[email protected]>
>         <mailto:[email protected]>
>         Date: January 10, 2020 at 1:09:18 PM
>         To: [email protected] <mailto:[email protected]>
>         <[email protected]> <mailto:[email protected]>
>         Cc: [email protected] <mailto:[email protected]>
>         <[email protected]> <mailto:[email protected]>,
>         [email protected] <mailto:[email protected]> <[email protected]>
>         <mailto:[email protected]>, [email protected]
>         <mailto:[email protected]> <[email protected]>
>         <mailto:[email protected]>
>         Subject: [discuss] ZFS corruption: Null filled files
> 
>             I have 7 OmniOSce r151022 systems that are experiencing
>             corruption on thousands of files.   The files' content is
>             replaced with all NULLs.
> 
>               * Files are being corrupted at rest.  No change is
>                 happening to any file stats when this happens.
>               * Some files are still correct on the replicated copy. 
>                  Replication is run every 10 minutes with incremental
>                 zfs send/receive.
>               * Most files have a clean version in the snapshot
>                 history.   The date of the corruption is almost after
>                 the latest date in the file stats.
>               * All file work happens through NFS v4 from CentOS 6 & 7
>                 clients
>               * The paths are consolidated into one path via NFS
>                 referral links
>               * I have not run scrub on the pools yet.   They are all
>                 large and being hit hard as we hunt down all files with
>                 this problem.   I have a resilver running on one pool.
> 
>             Here's an analysis of one file.  I chose a log file since
>             the content is human-readable:
> 
>             The system with the primary pool:
>             Null filled:
> 
>             # head
>             
> /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
> 
>             # stat
>             
> /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
>               File:
>             
> /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
>               Size: 281070          Blocks: 1          IO Block: 131072
>             regular file
>             Device: 4350019h/70582297d      Inode: 1043386     Links: 1
>             Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid:
>             (60023/ UNKNOWN)
>             Access: 2019-10-29 07:29:18.399569973 -0500
>             Modify: 2019-10-29 18:19:45.006930486 -0500
>             Change: 2019-10-30 03:32:03.353868416 -0500
>              Birth: 2019-10-29 07:29:18.399569973 -0500
> 
>             Found clean in the 2019-11-07 snapshot.  This is the newest
>             snapshot with this file correct.
>             # head
>             
> /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow1.nrg.mir
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow3.humanconnectome.org
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow3.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow1.nrg.mir
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow1.nrg.mir
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow3.humanconnectome.org
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
> 
>             # stat
>             
> /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
>               File:
>             
> /HCPintraDB00/.zfs/snapshot/daily_2019-11-07_00:00-0600/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
>               Size: 281070          Blocks: 35         IO Block: 281088
>             regular file
>             Device: 435011fh/70582559d      Inode: 1043386     Links: 1
>             Access: (0644/-rw-r--r--)  Uid: (  603/ UNKNOWN)   Gid:
>             (60023/ UNKNOWN)
>             Access: 2019-10-29 07:29:18.399569973 -0500
>             Modify: 2019-10-29 18:19:45.006930486 -0500
>             Change: 2019-10-30 03:32:03.353868416 -0500
>              Birth: 2019-10-29 07:29:18.399569973 -0500
> 
>             On the replicated pool:
> 
>             # head
>             
> /HCPintraDB00/xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29
>             
> <http://xnat_home_hcpi-shadow17.nrg.wustl.edu/logs/xsync.log.2019-10-29>
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow1.nrg.mir
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow3.humanconnectome.org
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow3.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow1.nrg.mir
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow1.nrg.mir
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow1.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for https://intradb-shadow3.humanconnectome.org
>             2019-10-29 07:29:18 INFO  DefaultXsyncAliasRefresher:76 -
>             Refreshing Alias for http://intradb-shadow2.nrg.mir:8080
> 
>             Can anyone help identify this ZFS bug?
> 
>             It has been suggested by Tobi Oetiker to turn on zfs debug
>             messages with 'mdb -ke ::zfs_dbgmsg'.   These are heavily
>             loaded systems, can this cause performance problems?
> 
>             Any help appreciated.
> 
>             -Chip
> 
> 
> 
>             *illumos <https://illumos.topicbox.com/latest>* /
>             illumos-discuss / see discussions
>             <https://illumos.topicbox.com/groups/discuss> + participants
>             <https://illumos.topicbox.com/groups/discuss/members> +
>             delivery options
>             <https://illumos.topicbox.com/groups/discuss/subscription>
>             Permalink
>             
> <https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M62eeb322b4f2561729a06460>
> 
> 
> *illumos <https://illumos.topicbox.com/latest>* / illumos-discuss / see
> discussions <https://illumos.topicbox.com/groups/discuss> + participants
> <https://illumos.topicbox.com/groups/discuss/members> + delivery options
> <https://illumos.topicbox.com/groups/discuss/subscription> Permalink
> <https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-M21ffe18c5fd758306d5f5ec1>
> 

------------------------------------------
illumos: illumos-discuss
Permalink: 
https://illumos.topicbox.com/groups/discuss/Tf31d6ac3bf9b5d05-Mca7734dc611b32d01ef501bf
Delivery options: https://illumos.topicbox.com/groups/discuss/subscription

Reply via email to