Re: CURRENT: ZFS freezes system beyond reboot

2021-12-15 Thread Andriy Gapon

On 15/12/2021 19:55, FreeBSD User wrote:

It is spooky, if not to say "buggy", if ZFS is capable of freezing the whole 
box even if
the essential operating system stuff is isolated on a dedicated UFS filesystem.


I do not think that this is the case.
Commands that do not access anything on ZFS or anything related to ZFS should be 
unaffected.


--
Andriy Gapon



L2ARC: inexplicable disappearance of cache device (was: CURRENT: ZFS freezes system beyond reboot)

2021-12-15 Thread Graham Perrin

On 15/12/2021 17:55, FreeBSD User wrote:

… SSD, partioned into two halfes, one for ZIL, the other for
L2ARC. When showing "zpool status", the RAIDZ's HDDs (Hitachi/Seagate
4 TB NAS HDD) where "online", ZIL was "online" and
L2ARC device/vdev showd - nothing. …



The 'nothing' aspect reminds me of:

L2ARC: inexplicable disappearance, without removal, of cache device





Re: CURRENT: ZFS freezes system beyond reboot

2021-12-15 Thread Mark Millard via freebsd-current
From: FreeBSD User  wrote on
Date: Wed, 15 Dec 2021 18:55:09 +0100 :

> . . .
> 
> It is spooky, if not to say "buggy", if ZFS is capable of freezing the whole 
> box even if
> the essential operating system stuff is isolated on a dedicated UFS 
> filesystem.

I would guess that, for ZFS being in use, everything related to,
for example, the ARC is "essential operating system stuff", given
its tie to wired memory usage and the like that greatly changes
the wired memory usage pattern/sizing compared to ZFS not being
involved on the system (UFS only).

(I only use ZFS in a simpler context, however.)

===
Mark Millard
marklmi at yahoo.com




Re: CURRENT: ZFS freezes system beyond reboot

2021-12-15 Thread FreeBSD User
On Mon, 13 Dec 2021 09:30:50 +0200
Andriy Gapon  wrote:

> On 12/12/2021 18:45, Alan Somers wrote:
> > You need to look at what's causing those errors.  What kind of disks
> > are you using, with what HBA?  It's not surprising that any access to
> > ZFS hangs; that's what it's designed to do when a pool is suspended.  
> 
> However, a pool does not have to be suspended on errors.
> failmode property provides a couple of alternatives:
>   wait  Blocks all I/O access until the device connectivity is
> recovered and the errors are cleared.  This is the
> default behavior.
> 
>   continue  Returns EIO to any new write I/O requests but allows
> reads to any of the remaining healthy devices.  Any
> write requests that have yet to be committed to disk
> would be blocked.
> 
>   panic Prints out a message to the console and generates a
> system crash dump.
> 
> But neither does any magic.
> The errors will still be there.
> 

Hello.

The error's cause was not obvous. I used a SSD, partioned into two halfes, one 
for ZIL,
the other for L2ARC. When showing "zpool status", the RAIDZ's HDDs 
(Hitachi/Seagate 4 TB
NAS HDD) where "online", ZIL was "online" and L2ARC device/vdev showd - nothing.

I had to power off/power on the box. For several hours nothing moved on, the 
box was
frozen, any invocation of any ZFS volume related tool/command hanged the 
terminal/console.

Several datasets showed errors at <0x0>, nothing serious.

After deleting the ZIL and L2ARC extra SSD from the RAIDZ pool, verything went 
to normal
again.

It is spooky, if not to say "buggy", if ZFS is capable of freezing the whole 
box even if
the essential operating system stuff is isolated on a dedicated UFS filesystem.

Kind regards,

O. Hartmann



Re: CURRENT: ZFS freezes system beyond reboot

2021-12-12 Thread Andriy Gapon

On 12/12/2021 18:45, Alan Somers wrote:

You need to look at what's causing those errors.  What kind of disks
are you using, with what HBA?  It's not surprising that any access to
ZFS hangs; that's what it's designed to do when a pool is suspended.


However, a pool does not have to be suspended on errors.
failmode property provides a couple of alternatives:
 wait  Blocks all I/O access until the device connectivity is
   recovered and the errors are cleared.  This is the
   default behavior.

 continue  Returns EIO to any new write I/O requests but allows
   reads to any of the remaining healthy devices.  Any
   write requests that have yet to be committed to disk
   would be blocked.

 panic Prints out a message to the console and generates a
   system crash dump.

But neither does any magic.
The errors will still be there.

--
Andriy Gapon



Re: CURRENT: ZFS freezes system beyond reboot

2021-12-12 Thread Alan Somers
On Sun, Dec 12, 2021 at 2:22 AM FreeBSD User  wrote:
>
> Running CURRENT (FreeBSD 14.0-CURRENT #52 main-n251260-156fbc64857: Thu
> Dec  2 14:45:55 CET 2021 amd64), out of the sudden the ZFS RAIDZ pool
> suffered from an error:
>
> Solaris: WARNING: Pool 'POOL00' has encountered an uncorrectable I/O
> failure and has been suspended.
>
> The system does not repsond anymore on that pool, transactions to and
> from that pool are frozen, the system is 99.9% idle.
> The most "not so funny" part is: the box doesn't even recognize a
> "shutdown -r now" or a brute force "reboot". I still can login via ssh,
> but any action regarding the ZFS pool freezes the console/terminal.
>
> ZFS very often renders the system unresponsible forever. How can this
> be mitigated? The system in question is on a remote site and it seems
> not only to be bound to CURRENT, we realised similar problems on
> 13-STABLE as well.
>
> What can I do to "unfreeze" the ZFS? The main OS is, luckily, on an
> UFS/FFS filesystem and so not affected from that problem.
>
> By the way, here some more details, as far as I can pick those up:
>
> zpool clear POOL00 cannot clear errors for POOL00: I/O error
>
> Whatever took out the ZFS pool (can not see any hardware errors, the
> pool is part of services and especially a poudriere build system and
> under heavy load all the time, the box has 16 GB RAM), it also renders
> the rest of the system unusable in a way which is beyond a "reboot".
>
> Kind regrads,
> oh

You need to look at what's causing those errors.  What kind of disks
are you using, with what HBA?  It's not surprising that any access to
ZFS hangs; that's what it's designed to do when a pool is suspended.



CURRENT: ZFS freezes system beyond reboot

2021-12-12 Thread FreeBSD User
Running CURRENT (FreeBSD 14.0-CURRENT #52 main-n251260-156fbc64857: Thu
Dec  2 14:45:55 CET 2021 amd64), out of the sudden the ZFS RAIDZ pool
suffered from an error: 

Solaris: WARNING: Pool 'POOL00' has encountered an uncorrectable I/O
failure and has been suspended.

The system does not repsond anymore on that pool, transactions to and
from that pool are frozen, the system is 99.9% idle.
The most "not so funny" part is: the box doesn't even recognize a
"shutdown -r now" or a brute force "reboot". I still can login via ssh,
but any action regarding the ZFS pool freezes the console/terminal.

ZFS very often renders the system unresponsible forever. How can this
be mitigated? The system in question is on a remote site and it seems
not only to be bound to CURRENT, we realised similar problems on
13-STABLE as well. 

What can I do to "unfreeze" the ZFS? The main OS is, luckily, on an
UFS/FFS filesystem and so not affected from that problem.

By the way, here some more details, as far as I can pick those up:

zpool clear POOL00 cannot clear errors for POOL00: I/O error

Whatever took out the ZFS pool (can not see any hardware errors, the
pool is part of services and especially a poudriere build system and
under heavy load all the time, the box has 16 GB RAM), it also renders
the rest of the system unusable in a way which is beyond a "reboot".

Kind regrads,
oh