Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) [[UPDATE w/more tests]]

2019-04-29 Thread Warner Losh
On Sun, Apr 28, 2019 at 4:03 PM Karl Denninger  wrote:

> On 4/20/2019 15:56, Steven Hartland wrote:
> > Thanks for extra info, the next question would be have you eliminated
> > that corruption exists before the disk is removed?
> >
> > Would be interesting to add a zpool scrub to confirm this isn't the
> > case before the disk removal is attempted.
> >
> > Regards
> > Steve
> >
> > On 20/04/2019 18:35, Karl Denninger wrote:
> >>
> >> On 4/20/2019 10:50, Steven Hartland wrote:
> >>> Have you eliminated geli as possible source?
> >> No; I could conceivably do so by re-creating another backup volume
> >> set without geli-encrypting the drives, but I do not have an extra
> >> set of drives of the capacity required laying around to do that.  I
> >> would have to do it with lower-capacity disks, which I can attempt if
> >> you think it would help.  I *do* have open slots in the drive
> >> backplane to set up a second "test" unit of this sort.  For reasons
> >> below it will take at least a couple of weeks to get good data on
> >> whether the problem exists without geli, however.
> >>
> Ok, following up on this with more data
>
> First step taken was to create a *second* backup pool (I have the
> backplane slots open, fortunately) with three different disks but *no
> encryption.*
>
> I ran both side-by-side for several days, with the *unencrypted* one
> operating with one disk detached and offline (pulled physically) just as
> I do normally.  Then I swapped the two using the same paradigm.
>
> The difference was *dramatic* -- the resilver did *not* scan the entire
> disk; it only copied the changed blocks and was finished FAST.  A
> subsequent scrub came up 100% clean.
>
> Next I put THOSE disks in the vault (so as to make sure I didn't get
> hosed if something went wrong) and re-initialized the pool in question,
> leaving only the "geli" alone (in other words I zpool destroy'd the pool
> and then re-created it with all three disks connected and
> geli-attached.)  The purpose for doing this was to eliminate the
> possibility of old corruption somewhere, or some sort of problem with
> multiple, spanning years, in-place "zpool upgrade" commands.  Then I ran
> a base backup to initialize all three volumes, detached one and yanked
> it out of the backplane, as would be the usual, leaving the other two
> online and operating.
>
> I ran backups as usual for most of last week after doing this, with the
> 61.eli and 62-1.eli volumes online, and 62-2 physically out of the
> backplane.
>
> Today I swapped them again as I usually do (e.g. offline 62.1, geli
> detach, camcontrol standby and then yank it -- then insert the 62-2
> volume, geli attach and zpool online) and this is happening:
>
> [\u@NewFS /home/karl]# zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
> continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Sun Apr 28 12:57:47 2019
> 2.48T scanned at 202M/s, 1.89T issued at 154M/s, 3.27T total
> 1.89T resilvered, 57.70% done, 0 days 02:37:14 to go
> config:
>
> NAME  STATE READ WRITE CKSUM
> backupDEGRADED 0 0 0
>   mirror-0DEGRADED 0 0 0
> gpt/backup61.eli  ONLINE   0 0 0
> 11295390187305954877  OFFLINE  0 0 0  was
> /dev/gpt/backup62-1.eli
> gpt/backup62-2.eliONLINE   0 0 0
>
> errors: No known data errors
>
> The "3.27T" number is accurate (by "zpool list") for the space in use.
>
> There is not a snowball's chance in Hades that anywhere near 1.89T of
> that data (thus far, and it ain't done as you can see!) was modified
> between when all three disks were online and when the 62-2.eli volume
> was swapped back in for 62.1.eli.  No possible way.  Maybe some
> 100-200Gb of data has been touched across the backed-up filesystems in
> the last three-ish days but there's just flat-out no way it's more than
> that; this would imply an entropy of well over 50% of the writeable data
> on this box in less than a week!  That's NOT possible.  Further it's not
> 100%; it shows 2.48T scanned but 1.89T actually written to the other drive.
>
> So something is definitely foooged here and it does appear that geli is
> involved in it.  Whatever is foooging zfs the resilver process thinks it
> has to recopy MOST (but not all!) of the blocks in use, it appears, from
> the 61.eli volume to the 62-2.eli volume.
>
> The question is what would lead ZFS to think it has to do that -- it
> clearly DOES NOT as a *much* smaller percentage of the total TXG set on
> 61.eli was modified while 62-2.eli was offline and 62.1.eli was online.
>
> Again I note that on 11.1 and previous this resilver was a rapid
> operation; whatever was actually changed got copied 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20) [[UPDATE w/more tests]]

2019-04-28 Thread Karl Denninger
On 4/20/2019 15:56, Steven Hartland wrote:
> Thanks for extra info, the next question would be have you eliminated
> that corruption exists before the disk is removed?
>
> Would be interesting to add a zpool scrub to confirm this isn't the
> case before the disk removal is attempted.
>
>     Regards
>     Steve
>
> On 20/04/2019 18:35, Karl Denninger wrote:
>>
>> On 4/20/2019 10:50, Steven Hartland wrote:
>>> Have you eliminated geli as possible source?
>> No; I could conceivably do so by re-creating another backup volume
>> set without geli-encrypting the drives, but I do not have an extra
>> set of drives of the capacity required laying around to do that.  I
>> would have to do it with lower-capacity disks, which I can attempt if
>> you think it would help.  I *do* have open slots in the drive
>> backplane to set up a second "test" unit of this sort.  For reasons
>> below it will take at least a couple of weeks to get good data on
>> whether the problem exists without geli, however.
>>
Ok, following up on this with more data

First step taken was to create a *second* backup pool (I have the
backplane slots open, fortunately) with three different disks but *no
encryption.*

I ran both side-by-side for several days, with the *unencrypted* one
operating with one disk detached and offline (pulled physically) just as
I do normally.  Then I swapped the two using the same paradigm.

The difference was *dramatic* -- the resilver did *not* scan the entire
disk; it only copied the changed blocks and was finished FAST.  A
subsequent scrub came up 100% clean.

Next I put THOSE disks in the vault (so as to make sure I didn't get
hosed if something went wrong) and re-initialized the pool in question,
leaving only the "geli" alone (in other words I zpool destroy'd the pool
and then re-created it with all three disks connected and
geli-attached.)  The purpose for doing this was to eliminate the
possibility of old corruption somewhere, or some sort of problem with
multiple, spanning years, in-place "zpool upgrade" commands.  Then I ran
a base backup to initialize all three volumes, detached one and yanked
it out of the backplane, as would be the usual, leaving the other two
online and operating.

I ran backups as usual for most of last week after doing this, with the
61.eli and 62-1.eli volumes online, and 62-2 physically out of the
backplane.

Today I swapped them again as I usually do (e.g. offline 62.1, geli
detach, camcontrol standby and then yank it -- then insert the 62-2
volume, geli attach and zpool online) and this is happening:

[\u@NewFS /home/karl]# zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Apr 28 12:57:47 2019
    2.48T scanned at 202M/s, 1.89T issued at 154M/s, 3.27T total
    1.89T resilvered, 57.70% done, 0 days 02:37:14 to go
config:

    NAME  STATE READ WRITE CKSUM
    backup    DEGRADED 0 0 0
  mirror-0    DEGRADED 0 0 0
    gpt/backup61.eli  ONLINE   0 0 0
    11295390187305954877  OFFLINE  0 0 0  was
/dev/gpt/backup62-1.eli
    gpt/backup62-2.eli    ONLINE   0 0 0

errors: No known data errors

The "3.27T" number is accurate (by "zpool list") for the space in use.

There is not a snowball's chance in Hades that anywhere near 1.89T of
that data (thus far, and it ain't done as you can see!) was modified
between when all three disks were online and when the 62-2.eli volume
was swapped back in for 62.1.eli.  No possible way.  Maybe some
100-200Gb of data has been touched across the backed-up filesystems in
the last three-ish days but there's just flat-out no way it's more than
that; this would imply an entropy of well over 50% of the writeable data
on this box in less than a week!  That's NOT possible.  Further it's not
100%; it shows 2.48T scanned but 1.89T actually written to the other drive.

So something is definitely foooged here and it does appear that geli is
involved in it.  Whatever is foooging zfs the resilver process thinks it
has to recopy MOST (but not all!) of the blocks in use, it appears, from
the 61.eli volume to the 62-2.eli volume.

The question is what would lead ZFS to think it has to do that -- it
clearly DOES NOT as a *much* smaller percentage of the total TXG set on
61.eli was modified while 62-2.eli was offline and 62.1.eli was online.

Again I note that on 11.1 and previous this resilver was a rapid
operation; whatever was actually changed got copied but the system never
copied *nearly everything* on a resilver, including data that had not
been changed at all, on a mirrored set.

Obviously on a Raidz volume you have to go through the entire data
structure because parity has to be recomputed 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Karl Denninger
No; I can, but of course that's another ~8 hour (overnight) delay
between swaps.

That's not a bad idea however

On 4/20/2019 15:56, Steven Hartland wrote:
> Thanks for extra info, the next question would be have you eliminated
> that corruption exists before the disk is removed?
>
> Would be interesting to add a zpool scrub to confirm this isn't the
> case before the disk removal is attempted.
>
>     Regards
>     Steve
>
> On 20/04/2019 18:35, Karl Denninger wrote:
>>
>> On 4/20/2019 10:50, Steven Hartland wrote:
>>> Have you eliminated geli as possible source?
>> No; I could conceivably do so by re-creating another backup volume
>> set without geli-encrypting the drives, but I do not have an extra
>> set of drives of the capacity required laying around to do that. I
>> would have to do it with lower-capacity disks, which I can attempt if
>> you think it would help.  I *do* have open slots in the drive
>> backplane to set up a second "test" unit of this sort.  For reasons
>> below it will take at least a couple of weeks to get good data on
>> whether the problem exists without geli, however.
>>>
>>> I've just setup an old server which has a LSI 2008 running and old
>>> FW (11.0) so was going to have a go at reproducing this.
>>>
>>> Apart from the disconnect steps below is there anything else needed
>>> e.g. read / write workload during disconnect?
>>
>> Yes.  An attempt to recreate this on my sandbox machine using smaller
>> disks (WD RE-320s) and a decent amount of read/write activity (tens
>> to ~100 gigabytes) on a root mirror of three disks with one taken
>> offline did not succeed.  It *reliably* appears, however, on my
>> backup volumes with every drive swap. The sandbox machine is
>> physically identical other than the physical disks; both are Xeons
>> with ECC RAM in them.
>>
>> The only operational difference is that the backup volume sets have a
>> *lot* of data written to them via zfs send|zfs recv over the
>> intervening period where with "ordinary" activity from I/O (which was
>> the case on my sandbox) the I/O pattern is materially different.  The
>> root pool on the sandbox where I tried to reproduce it synthetically
>> *is* using geli (in fact it boots native-encrypted.)
>>
>> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is
>> a ~6-8 hour process.
>>
>> The usual process for the backup pool looks like this:
>>
>> Have 2 of the 3 physical disks mounted; the third is in the bank vault.
>>
>> Over the space of a week, the backup script is run daily.  It first
>> imports the pool and then for each zfs filesystem it is backing up
>> (which is not all of them; I have a few volatile ones that I don't
>> care if I lose, such as object directories for builds and such, plus
>> some that are R/O data sets that are backed up separately) it does:
>>
>> If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send
>> -R ...@zfs-base | zfs receive -Fuvd $BACKUP
>>
>> else
>>
>> zfs rename -r ...@zfs-base ...@zfs-old
>> zfs snapshot -r ...@zfs-base
>>
>> zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP
>>
>>  if ok then zfs destroy -vr ...@zfs-old otherwise print a
>> complaint and stop.
>>
>> When all are complete it then does a "zpool export backup" to detach
>> the pool in order to reduce the risk of "stupid root user" (me)
>> accidents.
>>
>> In short I send an incremental of the changes since the last backup,
>> which in many cases includes a bunch of automatic snapshots that are
>> taken on frequent basis out of the cron. Typically there are a week's
>> worth of these that accumulate between swaps of the disk to the
>> vault, and the offline'd disk remains that way for a week.  I also
>> wait for the zpool destroy on each of the targets to drain before
>> continuing, as not doing so back in the 9 and 10.x days was a good
>> way to stimulate an instant panic on re-import the next day due to
>> kernel stack page exhaustion if the previous operation destroyed
>> hundreds of gigabytes of snapshots (which does routinely happen as
>> part of the backed up data is Macrium images from PCs, so when a new
>> month comes around the PC's backup routine removes a huge amount of
>> old data from the filesystem.)
>>
>> Trying to simulate the checksum errors in a few hours' time thus far
>> has failed.  But every time I swap the disks on a weekly basis I get
>> a handful of checksum errors on the scrub. If I export and re-import
>> the backup mirror after that the counters are zeroed -- the checksum
>> error count does *not* remain across an export/import cycle although
>> the "scrub repaired" line remains.
>>
>> For example after the scrub completed this morning I exported the
>> pool (the script expects the pool exported before it begins) and ran
>> the backup.  When it was complete:
>>
>> root@NewFS:~/backup-zfs # zpool status backup
>>   pool: backup
>>  state: DEGRADED
>> status: One or more devices has been taken offline by the administrator.
>>     

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Steven Hartland
Thanks for extra info, the next question would be have you eliminated 
that corruption exists before the disk is removed?


Would be interesting to add a zpool scrub to confirm this isn't the case 
before the disk removal is attempted.


    Regards
    Steve

On 20/04/2019 18:35, Karl Denninger wrote:


On 4/20/2019 10:50, Steven Hartland wrote:

Have you eliminated geli as possible source?
No; I could conceivably do so by re-creating another backup volume set 
without geli-encrypting the drives, but I do not have an extra set of 
drives of the capacity required laying around to do that. I would have 
to do it with lower-capacity disks, which I can attempt if you think 
it would help.  I *do* have open slots in the drive backplane to set 
up a second "test" unit of this sort.  For reasons below it will take 
at least a couple of weeks to get good data on whether the problem 
exists without geli, however.


I've just setup an old server which has a LSI 2008 running and old FW 
(11.0) so was going to have a go at reproducing this.


Apart from the disconnect steps below is there anything else needed 
e.g. read / write workload during disconnect?


Yes.  An attempt to recreate this on my sandbox machine using smaller 
disks (WD RE-320s) and a decent amount of read/write activity (tens to 
~100 gigabytes) on a root mirror of three disks with one taken offline 
did not succeed.  It *reliably* appears, however, on my backup volumes 
with every drive swap. The sandbox machine is physically identical 
other than the physical disks; both are Xeons with ECC RAM in them.


The only operational difference is that the backup volume sets have a 
*lot* of data written to them via zfs send|zfs recv over the 
intervening period where with "ordinary" activity from I/O (which was 
the case on my sandbox) the I/O pattern is materially different.  The 
root pool on the sandbox where I tried to reproduce it synthetically 
*is* using geli (in fact it boots native-encrypted.)


The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is 
a ~6-8 hour process.


The usual process for the backup pool looks like this:

Have 2 of the 3 physical disks mounted; the third is in the bank vault.

Over the space of a week, the backup script is run daily.  It first 
imports the pool and then for each zfs filesystem it is backing up 
(which is not all of them; I have a few volatile ones that I don't 
care if I lose, such as object directories for builds and such, plus 
some that are R/O data sets that are backed up separately) it does:


If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send 
-R ...@zfs-base | zfs receive -Fuvd $BACKUP


else

zfs rename -r ...@zfs-base ...@zfs-old
zfs snapshot -r ...@zfs-base

zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP

 if ok then zfs destroy -vr ...@zfs-old otherwise print a 
complaint and stop.


When all are complete it then does a "zpool export backup" to detach 
the pool in order to reduce the risk of "stupid root user" (me) accidents.


In short I send an incremental of the changes since the last backup, 
which in many cases includes a bunch of automatic snapshots that are 
taken on frequent basis out of the cron. Typically there are a week's 
worth of these that accumulate between swaps of the disk to the vault, 
and the offline'd disk remains that way for a week.  I also wait for 
the zpool destroy on each of the targets to drain before continuing, 
as not doing so back in the 9 and 10.x days was a good way to 
stimulate an instant panic on re-import the next day due to kernel 
stack page exhaustion if the previous operation destroyed hundreds of 
gigabytes of snapshots (which does routinely happen as part of the 
backed up data is Macrium images from PCs, so when a new month comes 
around the PC's backup routine removes a huge amount of old data from 
the filesystem.)


Trying to simulate the checksum errors in a few hours' time thus far 
has failed.  But every time I swap the disks on a weekly basis I get a 
handful of checksum errors on the scrub. If I export and re-import the 
backup mirror after that the counters are zeroed -- the checksum error 
count does *not* remain across an export/import cycle although the 
"scrub repaired" line remains.


For example after the scrub completed this morning I exported the pool 
(the script expects the pool exported before it begins) and ran the 
backup.  When it was complete:


root@NewFS:~/backup-zfs # zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning 
in a

    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat 
Apr 20 08:45:09 2019

config:

    NAME  STATE READ WRITE CKSUM
    backup 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Karl Denninger

On 4/20/2019 10:50, Steven Hartland wrote:
> Have you eliminated geli as possible source?
No; I could conceivably do so by re-creating another backup volume set
without geli-encrypting the drives, but I do not have an extra set of
drives of the capacity required laying around to do that.  I would have
to do it with lower-capacity disks, which I can attempt if you think it
would help.  I *do* have open slots in the drive backplane to set up a
second "test" unit of this sort.  For reasons below it will take at
least a couple of weeks to get good data on whether the problem exists
without geli, however.
>
> I've just setup an old server which has a LSI 2008 running and old FW
> (11.0) so was going to have a go at reproducing this.
>
> Apart from the disconnect steps below is there anything else needed
> e.g. read / write workload during disconnect?

Yes.  An attempt to recreate this on my sandbox machine using smaller
disks (WD RE-320s) and a decent amount of read/write activity (tens to
~100 gigabytes) on a root mirror of three disks with one taken offline
did not succeed.  It *reliably* appears, however, on my backup volumes
with every drive swap.  The sandbox machine is physically identical
other than the physical disks; both are Xeons with ECC RAM in them.

The only operational difference is that the backup volume sets have a
*lot* of data written to them via zfs send|zfs recv over the intervening
period where with "ordinary" activity from I/O (which was the case on my
sandbox) the I/O pattern is materially different.  The root pool on the
sandbox where I tried to reproduce it synthetically *is* using geli (in
fact it boots native-encrypted.)

The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is a
~6-8 hour process.

The usual process for the backup pool looks like this:

Have 2 of the 3 physical disks mounted; the third is in the bank vault.

Over the space of a week, the backup script is run daily.  It first
imports the pool and then for each zfs filesystem it is backing up
(which is not all of them; I have a few volatile ones that I don't care
if I lose, such as object directories for builds and such, plus some
that are R/O data sets that are backed up separately) it does:

If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send -R
...@zfs-base | zfs receive -Fuvd $BACKUP

else

zfs rename -r ...@zfs-base ...@zfs-old
zfs snapshot -r ...@zfs-base

zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP

 if ok then zfs destroy -vr ...@zfs-old otherwise print a complaint
and stop.

When all are complete it then does a "zpool export backup" to detach the
pool in order to reduce the risk of "stupid root user" (me) accidents.

In short I send an incremental of the changes since the last backup,
which in many cases includes a bunch of automatic snapshots that are
taken on frequent basis out of the cron.  Typically there are a week's
worth of these that accumulate between swaps of the disk to the vault,
and the offline'd disk remains that way for a week.  I also wait for the
zpool destroy on each of the targets to drain before continuing, as not
doing so back in the 9 and 10.x days was a good way to stimulate an
instant panic on re-import the next day due to kernel stack page
exhaustion if the previous operation destroyed hundreds of gigabytes of
snapshots (which does routinely happen as part of the backed up data is
Macrium images from PCs, so when a new month comes around the PC's
backup routine removes a huge amount of old data from the filesystem.)

Trying to simulate the checksum errors in a few hours' time thus far has
failed.  But every time I swap the disks on a weekly basis I get a
handful of checksum errors on the scrub.  If I export and re-import the
backup mirror after that the counters are zeroed -- the checksum error
count does *not* remain across an export/import cycle although the
"scrub repaired" line remains.

For example after the scrub completed this morning I exported the pool
(the script expects the pool exported before it begins) and ran the
backup.  When it was complete:

root@NewFS:~/backup-zfs # zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

    NAME  STATE READ WRITE CKSUM
    backup    DEGRADED 0 0 0
  mirror-0    DEGRADED 0 0 0
    gpt/backup61.eli  ONLINE   0 0 0
    gpt/backup62-1.eli    ONLINE   0 0 0
    13282812295755460479  OFFLINE  0 0 0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

It knows it 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Steven Hartland

Have you eliminated geli as possible source?

I've just setup an old server which has a LSI 2008 running and old FW 
(11.0) so was going to have a go at reproducing this.


Apart from the disconnect steps below is there anything else needed e.g. 
read / write workload during disconnect?


mps0:  port 0xe000-0xe0ff mem 
0xfaf3c000-0xfaf3,0xfaf4-0xfaf7 irq 26 at device 0.0 on pci3

mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 
185c


    Regards
    Steve

On 20/04/2019 15:39, Karl Denninger wrote:

I can confirm that 20.00.07.00 does *not* stop this.
The previous write/scrub on this device was on 20.00.07.00.  It was
swapped back in from the vault yesterday, resilvered without incident,
but a scrub says

root@NewFS:/home/karl # zpool status backup
   pool: backup
  state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
     attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
     using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://illumos.org/msg/ZFS-8000-9P
   scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

     NAME  STATE READ WRITE CKSUM
     backup    DEGRADED 0 0 0
   mirror-0    DEGRADED 0 0 0
     gpt/backup61.eli  ONLINE   0 0 0
     gpt/backup62-1.eli    ONLINE   0 0    47
     13282812295755460479  OFFLINE  0 0 0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

So this is firmware-invariant (at least between 19.00.00.00 and
20.00.07.00); the issue persists.

Again, in my instance these devices are never removed "unsolicited" so
there can't be (or at least shouldn't be able to) unflushed data in the
device or kernel cache.  The procedure is and remains:

zpool offline .
geli detach .
camcontrol standby ...

Wait a few seconds for the spindle to spin down.

Remove disk.

Then of course on the other side after insertion and the kernel has
reported "finding" the device:

geli attach ...
zpool online 

Wait...

If this is a boogered TXG that's held in the metadata for the
"offline"'d device (maybe "off by one"?) that's potentially bad in that
if there is an unknown failure in the other mirror component the
resilver will complete but data has been irrevocably destroyed.

Granted, this is a very low probability scenario (the area where the bad
checksums are has to be where the corruption hits, and it has to happen
between the resilver and access to that data.)  Those are long odds but
nonetheless a window of "you're hosed" does appear to exist.



___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Karl Denninger

On 4/13/2019 06:00, Karl Denninger wrote:
> On 4/11/2019 13:57, Karl Denninger wrote:
>> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger  wrote:
>>>
>>>
 In this specific case the adapter in question is...

 mps0:  port 0xc000-0xc0ff mem
 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
 mps0: IOCCapabilities:
 1285c

 Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
 his drives via dumb on-MoBo direct SATA connections.

>>> Maybe I'm in good company.  My current setup has 8 of the disks connected
>>> to:
>>>
>>> mps0:  port 0xb000-0xb0ff mem
>>> 0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 5a85c
>>>
>>> ... just with a cable that breaks out each of the 2 connectors into 4
>>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
>>> cache/log) connected to ports on...
>>>
>>> - ahci0:  port
>>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
>>> 0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2
>>> - ahci2:  port
>>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
>>> 0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7
>>> - ahci3:  port
>>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
>>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>>
>>> ... each drive connected to a single port.
>>>
>>> I can actually reproduce this at will.  Because I have 16 drives, when one
>>> fails, I need to find it.  I pull the sata cable for a drive, determine if
>>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
>>> resilver to stop... usually only a minute or two.
>>>
>>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
>>> that a drive is part of the SAS controller or the SATA controllers... so
>>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
>>> More often than not, the a scrub will find a few problems.  In fact, it
>>> appears that the most recent scrub is an example:
>>>
>>> [1:7:306]dgilbert@vr:~> zpool status
>>>   pool: vr1
>>>  state: ONLINE
>>>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
>>> 2019
>>> config:
>>>
>>> NAMESTATE READ WRITE CKSUM
>>> vr1 ONLINE   0 0 0
>>>   raidz2-0  ONLINE   0 0 0
>>> gpt/v1-d0   ONLINE   0 0 0
>>> gpt/v1-d1   ONLINE   0 0 0
>>> gpt/v1-d2   ONLINE   0 0 0
>>> gpt/v1-d3   ONLINE   0 0 0
>>> gpt/v1-d4   ONLINE   0 0 0
>>> gpt/v1-d5   ONLINE   0 0 0
>>> gpt/v1-d6   ONLINE   0 0 0
>>> gpt/v1-d7   ONLINE   0 0 0
>>>   raidz2-2  ONLINE   0 0 0
>>> gpt/v1-e0c  ONLINE   0 0 0
>>> gpt/v1-e1b  ONLINE   0 0 0
>>> gpt/v1-e2b  ONLINE   0 0 0
>>> gpt/v1-e3b  ONLINE   0 0 0
>>> gpt/v1-e4b  ONLINE   0 0 0
>>> gpt/v1-e5a  ONLINE   0 0 0
>>> gpt/v1-e6a  ONLINE   0 0 0
>>> gpt/v1-e7c  ONLINE   0 0 0
>>> logs
>>>   gpt/vr1logONLINE   0 0 0
>>> cache
>>>   gpt/vr1cache  ONLINE   0 0 0
>>>
>>> errors: No known data errors
>>>
>>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
>>> drives that I had trial-removed (and not on the one replaced).
>>> ___
>> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
>> after a scrub, comes up with the checksum errors.  It does *not* flag
>> any errors during the resilver and the drives *not* taken offline do not
>> (ever) show checksum errors either.
>>
>> Interestingly enough you have 19.00.00.00 firmware on your card as well
>> -- which is what was on mine.
>>
>> I have flashed my card forward to 20.00.07.00 -- we'll see if it still
>> does it when I do the next swap of the backup set.
> Verry interesting.
>
> This drive was last written/read under 19.00.00.00.  Yesterday I swapped
> it back in.  Note that right now I am running:
>
> mps0:  port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c
>
> And, after the scrub completed overnight
>
> [karl@NewFS ~]$ zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-13 Thread Karl Denninger
On 4/11/2019 13:57, Karl Denninger wrote:
> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger  wrote:
>>
>>
>>> In this specific case the adapter in question is...
>>>
>>> mps0:  port 0xc000-0xc0ff mem
>>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 1285c
>>>
>>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
>>> his drives via dumb on-MoBo direct SATA connections.
>>>
>> Maybe I'm in good company.  My current setup has 8 of the disks connected
>> to:
>>
>> mps0:  port 0xb000-0xb0ff mem
>> 0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6
>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 5a85c
>>
>> ... just with a cable that breaks out each of the 2 connectors into 4
>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
>> cache/log) connected to ports on...
>>
>> - ahci0:  port
>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
>> 0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2
>> - ahci2:  port
>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
>> 0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7
>> - ahci3:  port
>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>
>> ... each drive connected to a single port.
>>
>> I can actually reproduce this at will.  Because I have 16 drives, when one
>> fails, I need to find it.  I pull the sata cable for a drive, determine if
>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
>> resilver to stop... usually only a minute or two.
>>
>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
>> that a drive is part of the SAS controller or the SATA controllers... so
>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
>> More often than not, the a scrub will find a few problems.  In fact, it
>> appears that the most recent scrub is an example:
>>
>> [1:7:306]dgilbert@vr:~> zpool status
>>   pool: vr1
>>  state: ONLINE
>>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
>> 2019
>> config:
>>
>> NAMESTATE READ WRITE CKSUM
>> vr1 ONLINE   0 0 0
>>   raidz2-0  ONLINE   0 0 0
>> gpt/v1-d0   ONLINE   0 0 0
>> gpt/v1-d1   ONLINE   0 0 0
>> gpt/v1-d2   ONLINE   0 0 0
>> gpt/v1-d3   ONLINE   0 0 0
>> gpt/v1-d4   ONLINE   0 0 0
>> gpt/v1-d5   ONLINE   0 0 0
>> gpt/v1-d6   ONLINE   0 0 0
>> gpt/v1-d7   ONLINE   0 0 0
>>   raidz2-2  ONLINE   0 0 0
>> gpt/v1-e0c  ONLINE   0 0 0
>> gpt/v1-e1b  ONLINE   0 0 0
>> gpt/v1-e2b  ONLINE   0 0 0
>> gpt/v1-e3b  ONLINE   0 0 0
>> gpt/v1-e4b  ONLINE   0 0 0
>> gpt/v1-e5a  ONLINE   0 0 0
>> gpt/v1-e6a  ONLINE   0 0 0
>> gpt/v1-e7c  ONLINE   0 0 0
>> logs
>>   gpt/vr1logONLINE   0 0 0
>> cache
>>   gpt/vr1cache  ONLINE   0 0 0
>>
>> errors: No known data errors
>>
>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
>> drives that I had trial-removed (and not on the one replaced).
>> ___
> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
> after a scrub, comes up with the checksum errors.  It does *not* flag
> any errors during the resilver and the drives *not* taken offline do not
> (ever) show checksum errors either.
>
> Interestingly enough you have 19.00.00.00 firmware on your card as well
> -- which is what was on mine.
>
> I have flashed my card forward to 20.00.07.00 -- we'll see if it still
> does it when I do the next swap of the backup set.

Verry interesting.

This drive was last written/read under 19.00.00.00.  Yesterday I swapped
it back in.  Note that right now I am running:

mps0:  port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c

And, after the scrub completed overnight

[karl@NewFS ~]$ zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-11 Thread Karl Denninger

On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger  wrote:
>
>
>> In this specific case the adapter in question is...
>>
>> mps0:  port 0xc000-0xc0ff mem
>> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
>> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
>> mps0: IOCCapabilities:
>> 1285c
>>
>> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
>> his drives via dumb on-MoBo direct SATA connections.
>>
> Maybe I'm in good company.  My current setup has 8 of the disks connected
> to:
>
> mps0:  port 0xb000-0xb0ff mem
> 0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6
> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 5a85c
>
> ... just with a cable that breaks out each of the 2 connectors into 4
> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
> cache/log) connected to ports on...
>
> - ahci0:  port
> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
> 0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2
> - ahci2:  port
> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
> 0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7
> - ahci3:  port
> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>
> ... each drive connected to a single port.
>
> I can actually reproduce this at will.  Because I have 16 drives, when one
> fails, I need to find it.  I pull the sata cable for a drive, determine if
> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
> resilver to stop... usually only a minute or two.
>
> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
> that a drive is part of the SAS controller or the SATA controllers... so
> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
> More often than not, the a scrub will find a few problems.  In fact, it
> appears that the most recent scrub is an example:
>
> [1:7:306]dgilbert@vr:~> zpool status
>   pool: vr1
>  state: ONLINE
>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
> 2019
> config:
>
> NAMESTATE READ WRITE CKSUM
> vr1 ONLINE   0 0 0
>   raidz2-0  ONLINE   0 0 0
> gpt/v1-d0   ONLINE   0 0 0
> gpt/v1-d1   ONLINE   0 0 0
> gpt/v1-d2   ONLINE   0 0 0
> gpt/v1-d3   ONLINE   0 0 0
> gpt/v1-d4   ONLINE   0 0 0
> gpt/v1-d5   ONLINE   0 0 0
> gpt/v1-d6   ONLINE   0 0 0
> gpt/v1-d7   ONLINE   0 0 0
>   raidz2-2  ONLINE   0 0 0
> gpt/v1-e0c  ONLINE   0 0 0
> gpt/v1-e1b  ONLINE   0 0 0
> gpt/v1-e2b  ONLINE   0 0 0
> gpt/v1-e3b  ONLINE   0 0 0
> gpt/v1-e4b  ONLINE   0 0 0
> gpt/v1-e5a  ONLINE   0 0 0
> gpt/v1-e6a  ONLINE   0 0 0
> gpt/v1-e7c  ONLINE   0 0 0
> logs
>   gpt/vr1logONLINE   0 0 0
> cache
>   gpt/vr1cache  ONLINE   0 0 0
>
> errors: No known data errors
>
> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
> drives that I had trial-removed (and not on the one replaced).
> ___

That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
after a scrub, comes up with the checksum errors.  It does *not* flag
any errors during the resilver and the drives *not* taken offline do not
(ever) show checksum errors either.

Interestingly enough you have 19.00.00.00 firmware on your card as well
-- which is what was on mine.

I have flashed my card forward to 20.00.07.00 -- we'll see if it still
does it when I do the next swap of the backup set.

-- 
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-11 Thread Zaphod Beeblebrox
On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger  wrote:


> In this specific case the adapter in question is...
>
> mps0:  port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c
>
> Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
> his drives via dumb on-MoBo direct SATA connections.
>

Maybe I'm in good company.  My current setup has 8 of the disks connected
to:

mps0:  port 0xb000-0xb0ff mem
0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6
mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
5a85c

... just with a cable that breaks out each of the 2 connectors into 4
SATA-style connectors, and the other 8 disks (plus boot disks and SSD
cache/log) connected to ports on...

- ahci0:  port
0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2
- ahci2:  port
0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7
- ahci3:  port
0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0

... each drive connected to a single port.

I can actually reproduce this at will.  Because I have 16 drives, when one
fails, I need to find it.  I pull the sata cable for a drive, determine if
it's the drive in question, if not, reconnect, "ONLINE" it and wait for
resilver to stop... usually only a minute or two.

... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
that a drive is part of the SAS controller or the SATA controllers... so
I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
More often than not, the a scrub will find a few problems.  In fact, it
appears that the most recent scrub is an example:

[1:7:306]dgilbert@vr:~> zpool status
  pool: vr1
 state: ONLINE
  scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
2019
config:

NAMESTATE READ WRITE CKSUM
vr1 ONLINE   0 0 0
  raidz2-0  ONLINE   0 0 0
gpt/v1-d0   ONLINE   0 0 0
gpt/v1-d1   ONLINE   0 0 0
gpt/v1-d2   ONLINE   0 0 0
gpt/v1-d3   ONLINE   0 0 0
gpt/v1-d4   ONLINE   0 0 0
gpt/v1-d5   ONLINE   0 0 0
gpt/v1-d6   ONLINE   0 0 0
gpt/v1-d7   ONLINE   0 0 0
  raidz2-2  ONLINE   0 0 0
gpt/v1-e0c  ONLINE   0 0 0
gpt/v1-e1b  ONLINE   0 0 0
gpt/v1-e2b  ONLINE   0 0 0
gpt/v1-e3b  ONLINE   0 0 0
gpt/v1-e4b  ONLINE   0 0 0
gpt/v1-e5a  ONLINE   0 0 0
gpt/v1-e6a  ONLINE   0 0 0
gpt/v1-e7c  ONLINE   0 0 0
logs
  gpt/vr1logONLINE   0 0 0
cache
  gpt/vr1cache  ONLINE   0 0 0

errors: No known data errors

... it doesn't say it now, but there were 5 CKSUM errors on one of the
drives that I had trial-removed (and not on the one replaced).
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-10 Thread Karl Denninger
On 4/10/2019 08:45, Andriy Gapon wrote:
> On 10/04/2019 04:09, Karl Denninger wrote:
>> Specifically, I *explicitly* OFFLINE the disk in question, which is a
>> controlled operation and *should* result in a cache flush out of the ZFS
>> code into the drive before it is OFFLINE'd.
>>
>> This should result in the "last written" TXG that the remaining online
>> members have, and the one in the offline member, being consistent.
>>
>> Then I "camcontrol standby" the involved drive, which forces a writeback
>> cache flush and a spindown; in other words, re-ordered or not, the
>> on-platter data *should* be consistent with what the system thinks
>> happened before I yank the physical device.
> This may not be enough for a specific [RAID] controller and a specific
> configuration.  It should be enough for a dumb HBA.  But, for example, 
> mrsas(9)
> can simply ignore the synchronize cache command (meaning neither the on-board
> cache is flushed nor the command is propagated to a disk).  So, if you use 
> some
> advanced controller it would make sense to use its own management tool to
> offline a disk before pulling it.
>
> I do not preclude a possibility of an issue in ZFS.  But it's not the only
> possibility either.

In this specific case the adapter in question is...

mps0:  port 0xc000-0xc0ff mem
0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
1285c

Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
his drives via dumb on-MoBo direct SATA connections.

What I don't know (yet) is if the update to firmware 20.00.07.00 in the
HBA has fixed it.  The 11.2 and 12.0 revs of FreeBSD through some
mechanism changed timing quite materially in the mps driver; prior to
11.2 I ran with a Lenovo SAS expander connected to SATA disks without
any problems at all, even across actual disk failures through the years,
but in 11.2 and 12.0 doing this resulted in spurious retries out of the
CAM layer that allegedly came from timeouts on individual units (which
looked very much like a lost command sent to the disk), but only on
mirrored volume sets -- yet there were no errors reported by the drive
itself, nor did either of my RaidZ2 pools (one spinning rust, one SSD)
experience problems of any sort.   Flashing the HBA forward to
20.00.07.00 with the expander in resulted in the  *driver* (mps) taking
disconnects and resets instead of the targets, which in turn caused
random drive fault events across all of the pools.  For obvious reasons
that got backed out *fast*.

Without the expander 19.00.00.00 has been stable over the last few
months *except* for this circumstance, where an intentionally OFFLINE'd
disk in a mirror that is brought back online after some reasonably long
period of time (days to a week) results in a successful resilver but
then a small number of checksum errors on that drive -- always on the
one that was OFFLINEd, never on the one(s) not taken OFFLINE -- appear
and are corrected when a scrub is subsequently performed.  I am now on
20.00.07.00 and so far -- no problems.  But I've yet to do the backup
disk swap on 20.00.07.00 (scheduled for late week or Monday) so I do not
know if the 20.00.07.00 roll-forward addresses the scrub issue or not. 
I have no reason to believe it is involved, but given the previous
"iffy" nature of 11.2 and 12.0 on 19.0 with the expander it very well
might be due to what appear to be timing changes in the driver architecture.

-- 
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-10 Thread Andriy Gapon
On 10/04/2019 04:09, Karl Denninger wrote:
> Specifically, I *explicitly* OFFLINE the disk in question, which is a
> controlled operation and *should* result in a cache flush out of the ZFS
> code into the drive before it is OFFLINE'd.
> 
> This should result in the "last written" TXG that the remaining online
> members have, and the one in the offline member, being consistent.
> 
> Then I "camcontrol standby" the involved drive, which forces a writeback
> cache flush and a spindown; in other words, re-ordered or not, the
> on-platter data *should* be consistent with what the system thinks
> happened before I yank the physical device.

This may not be enough for a specific [RAID] controller and a specific
configuration.  It should be enough for a dumb HBA.  But, for example, mrsas(9)
can simply ignore the synchronize cache command (meaning neither the on-board
cache is flushed nor the command is propagated to a disk).  So, if you use some
advanced controller it would make sense to use its own management tool to
offline a disk before pulling it.

I do not preclude a possibility of an issue in ZFS.  But it's not the only
possibility either.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-09 Thread Karl Denninger
On 4/9/2019 16:27, Zaphod Beeblebrox wrote:
> I have a "Ghetto" home RAID array.  It's built on compromises and makes use
> of RAID-Z2 to survive.  It consists of two plexes of 8x 4T units of
> "spinning rust".  It's been upgraded and upgraded.  It started as 8x 2T,
> then 8x 2T + 8x 4T then the current 16x 4T.  The first 8 disks are
> connected to motherboard SATA.  IIRC, there are 10.  Two ports are used for
> a mirror that it boots from.  There's also an SSD in there somhow, so it
> might be 12 ports on the motherboard.
>
> The other 8 disks started life in eSATA port multiplier boxes.  That was
> doubleplusungood, so I got a RAID card based on LSI pulled from a fujitsu
> server in Japan.  That's been upgraded a couple of times... not always a
> good experience.  One problem is that cheap or refurbished drives don't
> always "like" SAS controllers and FreeBSD.  YMMV.
>
> Anyways, this is all to introduce the fact that I've seen this behaviour
> multiple times. You have a drive that leaves the array for some amount of
> time, and after resilvering, a scrub will find a small amount of bad data.
> 32 k or 40k or somesuch.  In my cranial schema of things, I've chalked it
> up to out-of-order writing of the drives ... or other such behavior s.t.
> ZFS doesn't know exactly what has been written.  I've often wondered if the
> fix would be to add an amount of fuzz to the transaction range that is
> resilvered.
>
>
> On Tue, Apr 9, 2019 at 4:32 PM Karl Denninger  wrote:
>
>> On 4/9/2019 15:04, Andriy Gapon wrote:
>>> On 09/04/2019 22:01, Karl Denninger wrote:
 the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S
 IN USE AREA was examined, compared, and blocks not on the "new member"
 or changed copied over.
>>> I think that that's not entirely correct.
>>> ZFS maintains something called DTL, a dirty-time log, for a missing /
>> offlined /
>>> removed device.  When the device re-appears and gets resilvered, ZFS
>> walks only
>>> those blocks that were born within the TXG range(s) when the device was
>> missing.
>>> In any case, I do not have an explanation for what you are seeing.
>> That implies something much more-serious could be wrong such as given
>> enough time -- a week, say -- that the DTL marker is incorrect and some
>> TXGs that were in fact changed since the OFFLINE are not walked through
>> and synchronized.  That would explain why it gets caught by a scrub --
>> the resilver is in fact not actually copying all the blocks that got
>> changed and so when you scrub the blocks are not identical.  Assuming
>> the detached disk is consistent that's not catastrophically bad IF
>> CAUGHT; where you'd get screwed HARD is in the situation where (for
>> example) you had a 2-unit mirror, detached one, re-attached it, resilver
>> says all is well, there is no scrub performed and then the
>> *non-detached* disk fails before there is a scrub.  In that case you
>> will have permanently destroyed or corrupted data since the other disk
>> is allegedly consistent but there are blocks *missing* that were never
>> copied over.
>>
>> Again this just showed up on 12.x; it definitely was *not* at issue in
>> 11.1 at all.  I never ran 11.2 in production for a material amount of
>> time (I went from 11.1 to 12.0 STABLE after the IPv6 fixes were posted
>> to 12.x) so I don't know if it is in play on 11.2 or not.
>>
>> I'll see if it shows up again with 20.00.07.00 card firmware.
>>
>> Of note I cannot reproduce this on my test box with EITHER 19.00.00.00
>> or 20.00.07.00 firmware when I set up a 3-unit mirror, offline one, make
>> a crap-ton of changes, offline the second and reattach the third (in
>> effect mirroring the "take one to the vault" thing) with a couple of
>> hours elapsed time and a synthetic (e.g. "dd if=/dev/random of=outfile
>> bs=1m" sort of thing) "make me some new data that has to be resilvered"
>> workload.  I don't know if that's because I need more entropy in the
>> filesystem than I can reasonably generate this way (e.g. more
>> fragmentation of files, etc) or whether it's a time-based issue (e.g.
>> something's wrong with the DTL/TXG thing as you note above in terms of
>> how it functions and it only happens if the time elapsed causes
>> something to be subject to a rollover or similar problem.)
>>
>> I spent quite a lot of time trying to make reproduce the issue on my
>> "sandbox" machine and was unable -- and of note it is never a large
>> quantity of data that is impacted, it's usually only a couple of dozen
>> checksums that show as bad and fixed.  Of note it's also never just one;
>> if there was a single random hit on a data block due to ordinary bitrot
>> sort of issues I'd expect only one checksum to be bad.  But generating a
>> realistic synthetic workload over the amount of time involved on a
>> sandbox is not trivial at all; the system on which this is now happening
>> handles a lot of email and routine processing of various sorts including
>> a fair bit of 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-09 Thread Zaphod Beeblebrox
I have a "Ghetto" home RAID array.  It's built on compromises and makes use
of RAID-Z2 to survive.  It consists of two plexes of 8x 4T units of
"spinning rust".  It's been upgraded and upgraded.  It started as 8x 2T,
then 8x 2T + 8x 4T then the current 16x 4T.  The first 8 disks are
connected to motherboard SATA.  IIRC, there are 10.  Two ports are used for
a mirror that it boots from.  There's also an SSD in there somhow, so it
might be 12 ports on the motherboard.

The other 8 disks started life in eSATA port multiplier boxes.  That was
doubleplusungood, so I got a RAID card based on LSI pulled from a fujitsu
server in Japan.  That's been upgraded a couple of times... not always a
good experience.  One problem is that cheap or refurbished drives don't
always "like" SAS controllers and FreeBSD.  YMMV.

Anyways, this is all to introduce the fact that I've seen this behaviour
multiple times. You have a drive that leaves the array for some amount of
time, and after resilvering, a scrub will find a small amount of bad data.
32 k or 40k or somesuch.  In my cranial schema of things, I've chalked it
up to out-of-order writing of the drives ... or other such behavior s.t.
ZFS doesn't know exactly what has been written.  I've often wondered if the
fix would be to add an amount of fuzz to the transaction range that is
resilvered.


On Tue, Apr 9, 2019 at 4:32 PM Karl Denninger  wrote:

> On 4/9/2019 15:04, Andriy Gapon wrote:
> > On 09/04/2019 22:01, Karl Denninger wrote:
> >> the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S
> >> IN USE AREA was examined, compared, and blocks not on the "new member"
> >> or changed copied over.
> > I think that that's not entirely correct.
> > ZFS maintains something called DTL, a dirty-time log, for a missing /
> offlined /
> > removed device.  When the device re-appears and gets resilvered, ZFS
> walks only
> > those blocks that were born within the TXG range(s) when the device was
> missing.
> >
> > In any case, I do not have an explanation for what you are seeing.
>
> That implies something much more-serious could be wrong such as given
> enough time -- a week, say -- that the DTL marker is incorrect and some
> TXGs that were in fact changed since the OFFLINE are not walked through
> and synchronized.  That would explain why it gets caught by a scrub --
> the resilver is in fact not actually copying all the blocks that got
> changed and so when you scrub the blocks are not identical.  Assuming
> the detached disk is consistent that's not catastrophically bad IF
> CAUGHT; where you'd get screwed HARD is in the situation where (for
> example) you had a 2-unit mirror, detached one, re-attached it, resilver
> says all is well, there is no scrub performed and then the
> *non-detached* disk fails before there is a scrub.  In that case you
> will have permanently destroyed or corrupted data since the other disk
> is allegedly consistent but there are blocks *missing* that were never
> copied over.
>
> Again this just showed up on 12.x; it definitely was *not* at issue in
> 11.1 at all.  I never ran 11.2 in production for a material amount of
> time (I went from 11.1 to 12.0 STABLE after the IPv6 fixes were posted
> to 12.x) so I don't know if it is in play on 11.2 or not.
>
> I'll see if it shows up again with 20.00.07.00 card firmware.
>
> Of note I cannot reproduce this on my test box with EITHER 19.00.00.00
> or 20.00.07.00 firmware when I set up a 3-unit mirror, offline one, make
> a crap-ton of changes, offline the second and reattach the third (in
> effect mirroring the "take one to the vault" thing) with a couple of
> hours elapsed time and a synthetic (e.g. "dd if=/dev/random of=outfile
> bs=1m" sort of thing) "make me some new data that has to be resilvered"
> workload.  I don't know if that's because I need more entropy in the
> filesystem than I can reasonably generate this way (e.g. more
> fragmentation of files, etc) or whether it's a time-based issue (e.g.
> something's wrong with the DTL/TXG thing as you note above in terms of
> how it functions and it only happens if the time elapsed causes
> something to be subject to a rollover or similar problem.)
>
> I spent quite a lot of time trying to make reproduce the issue on my
> "sandbox" machine and was unable -- and of note it is never a large
> quantity of data that is impacted, it's usually only a couple of dozen
> checksums that show as bad and fixed.  Of note it's also never just one;
> if there was a single random hit on a data block due to ordinary bitrot
> sort of issues I'd expect only one checksum to be bad.  But generating a
> realistic synthetic workload over the amount of time involved on a
> sandbox is not trivial at all; the system on which this is now happening
> handles a lot of email and routine processing of various sorts including
> a fair bit of database activity associated with network monitoring and
> statistical analysis.
>
> I'm assuming that using "offline" as a means to 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-09 Thread Karl Denninger
On 4/9/2019 15:04, Andriy Gapon wrote:
> On 09/04/2019 22:01, Karl Denninger wrote:
>> the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S
>> IN USE AREA was examined, compared, and blocks not on the "new member"
>> or changed copied over.
> I think that that's not entirely correct.
> ZFS maintains something called DTL, a dirty-time log, for a missing / 
> offlined /
> removed device.  When the device re-appears and gets resilvered, ZFS walks 
> only
> those blocks that were born within the TXG range(s) when the device was 
> missing.
>
> In any case, I do not have an explanation for what you are seeing.

That implies something much more-serious could be wrong such as given
enough time -- a week, say -- that the DTL marker is incorrect and some
TXGs that were in fact changed since the OFFLINE are not walked through
and synchronized.  That would explain why it gets caught by a scrub --
the resilver is in fact not actually copying all the blocks that got
changed and so when you scrub the blocks are not identical.  Assuming
the detached disk is consistent that's not catastrophically bad IF
CAUGHT; where you'd get screwed HARD is in the situation where (for
example) you had a 2-unit mirror, detached one, re-attached it, resilver
says all is well, there is no scrub performed and then the
*non-detached* disk fails before there is a scrub.  In that case you
will have permanently destroyed or corrupted data since the other disk
is allegedly consistent but there are blocks *missing* that were never
copied over.

Again this just showed up on 12.x; it definitely was *not* at issue in
11.1 at all.  I never ran 11.2 in production for a material amount of
time (I went from 11.1 to 12.0 STABLE after the IPv6 fixes were posted
to 12.x) so I don't know if it is in play on 11.2 or not.

I'll see if it shows up again with 20.00.07.00 card firmware.

Of note I cannot reproduce this on my test box with EITHER 19.00.00.00
or 20.00.07.00 firmware when I set up a 3-unit mirror, offline one, make
a crap-ton of changes, offline the second and reattach the third (in
effect mirroring the "take one to the vault" thing) with a couple of
hours elapsed time and a synthetic (e.g. "dd if=/dev/random of=outfile
bs=1m" sort of thing) "make me some new data that has to be resilvered"
workload.  I don't know if that's because I need more entropy in the
filesystem than I can reasonably generate this way (e.g. more
fragmentation of files, etc) or whether it's a time-based issue (e.g.
something's wrong with the DTL/TXG thing as you note above in terms of
how it functions and it only happens if the time elapsed causes
something to be subject to a rollover or similar problem.) 

I spent quite a lot of time trying to make reproduce the issue on my
"sandbox" machine and was unable -- and of note it is never a large
quantity of data that is impacted, it's usually only a couple of dozen
checksums that show as bad and fixed.  Of note it's also never just one;
if there was a single random hit on a data block due to ordinary bitrot
sort of issues I'd expect only one checksum to be bad.  But generating a
realistic synthetic workload over the amount of time involved on a
sandbox is not trivial at all; the system on which this is now happening
handles a lot of email and routine processing of various sorts including
a fair bit of database activity associated with network monitoring and
statistical analysis.

I'm assuming that using "offline" as a means to do this hasn't become
"invalid" as something that's considered "ok" as a means of doing this
sort of thing it certainly has worked perfectly well for a very long
time!

-- 
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-09 Thread Andriy Gapon
On 09/04/2019 22:01, Karl Denninger wrote:
> the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S
> IN USE AREA was examined, compared, and blocks not on the "new member"
> or changed copied over.

I think that that's not entirely correct.
ZFS maintains something called DTL, a dirty-time log, for a missing / offlined /
removed device.  When the device re-appears and gets resilvered, ZFS walks only
those blocks that were born within the TXG range(s) when the device was missing.

In any case, I do not have an explanation for what you are seeing.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-09 Thread Karl Denninger
I've run into something often -- and repeatably -- enough since updating
to 12-STABLE that I suspect there may be a code problem lurking in the
ZFS stack or in the driver and firmware compatibility with various HBAs
based on the LSI/Avago devices.

The scenario is this -- I have data sets that are RaidZ2 that are my
"normal" working set; one is comprised of SSD volumes and one of
spinning rust volumes.  These all are normal and scrubs never show
problems.  I've had physical failures with them over the years (although
none since moving to 12-STABLE as of yet) and have never had trouble
with resilvers or other misbehavior.

I also have a "backup" pool that is a 3-member mirror, to which the
volatile (that is, the zfs filesystems not set read-only) has zfs send's
done to.  Call them backup-i, backup-e1 and backup-e2.

All disks in these pools are geli-encrypted running on top of a
freebsd-zfs partition inside a GPT partition table using -s 4096 (4k)
geli "sectors".

Two of the backup mirror members are always in the machine; backup-i
(the base internal drive) is never removed.  The third is in a bank
vault.  Every week the vault drive is exchanged with the other, so that
the "first" member is never removed from the host, but the other two
(-e1 and -e2) alternate.  If the building burns I have a full copy of
all the volatile data in the vault.  (I also have mirrored copies, 2
each, of all the datasets that are operationally read-only in the vault
too; those get updated quarterly if there are changes to the
operationally read-only portion of the data store.)  The drive in the
vault is swapped weekly, so a problem should be detected almost
immediately before it can bugger me.

Before removing the disk intended to go to the vault I "offline" it,
spin it down (camcontrol standby) which issues a standby immediate to
the drive insuring that its cache is flushed and the spindle spun down
and then pull it.  I go exchange them at the bank, insert the other one,
and "zpool online" it, which automatically resilvers it.

The disk resilvers and all is well -- no errors.

Or is it all ok?

If I run a scrub on the pool as soon as the resilver completes the disk
I just inserted will /invariably /have a few checksum errors on it that
the scrub fixes.  It's not a large number, anywhere from a couple dozen
to a hundred or so, but it's not zero -- and it damn well should be as
the resilver JUST COMPLETED with no errors which means the ENTIRE DISK'S
IN USE AREA was examined, compared, and blocks not on the "new member"
or changed copied over.  The "-i" disk (the one that is never pulled)
NEVER is the one with the checksum errors on it -- it's ALWAYS the one I
just inserted and which was resilvered to.

If I zpool clear the errors and scrub again all is fine -- no errors. 
If I scrub again before pulling the disk the next time to do the swap
all is fine as well.  I swap the two, resilver, and I'll get a few more
errors on the next scrub, ALWAYS on the disk I just put in.

Smartctl shows NO errors on the disk.  No ECC, no reallocated sectors,
no interface errors, no resets, nothing.  Smartd is running and never
posts any real-time complaints, other than the expected one a minute or
two after I yank the drive to take it to the bank.  There are no
CAM-related errors printing on the console either.  So ZFS says there's
a *silent* data error (bad checksum; never a read or write error) in a
handful of blocks but the disk says there have been no errors, the
driver does not report any errors, there have been no power failures as
the disk was in a bank vault and thus it COULDN'T have had a write-back
cache corruption event or similar occur.

I never had trouble with this under 11.1 or before and have been using
this paradigm for something on the order of five years running on this
specific machine without incident.  Now I'm seeing it repeatedly and
*reliably* under 12.0-STABLE.  I swapped the first disk that did it,
thinking it was physically defective -- the replacement did it on the
next swap.  In fact I've yet to record a swap-out on 12-STABLE that
*hasn't* done this and yet it NEVER happened under 11.1.  At the same
time I can run scrubs until the cows come home on the multiple Raidz2
packs on the same controller and never get any checksum errors on any of
them.

The firmware in the card was 19.00.00.00 -- again, this firmware *has
been stable for years.* 

I have just rolled the firmware on the card forward to 20.00.07.00,
which is the "latest" available.  I had previously not moved to 20.x
because earlier versions had known issues (some severe and potentially
fatal to data integrity) and 19 had been working without problem -- I
thus had no reason to move to 20.00.07.00.

But there apparently are some fairly significant timing differences
between the driver code in 11.1 and 11.2/12.0, as I discovered when the
SAS expander I used to have in these boxes started returning timeout
errors that were false.  Again -- this same