Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-08 Thread Borja Marcos

On Mar 5, 2013, at 11:09 PM, Jeremy Chadwick wrote:

 - Disks are GPT and are *partitioned, and ZFS refers to the partitions
  not the raw disk -- this matters (honest, it really does; the ZFS
  code handles things differently with raw disks)
 
 Not on FreeBSD as far I can see.
 
 My statement comes from here (first line in particular):
 
 http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html
 
 If this is wrong/false, then this furthers my point about kernel folks
 who are in-the-know needing to chime in and help stop the
 misinformation.  The rest of us are just end-users, often misinformed.

As far as I know, this is lore than surfaces periodically in the lists. It was 
true in Solaris (at least in the past).
But unless I'm terribly wrong, this doesn't happen in FreeBSD. ZFS sees 
disks, and they can be a whole
raw device or a partition/slice, even a gnop device. No difference.

That's why I mentioned in freebsd-fs that we badly need an official doctrine, 
carefully curated, and written in holy letters ;)





Borja.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-07 Thread Karl Denninger

On 3/7/2013 1:21 AM, Peter Jeremy wrote:
 On 2013-Mar-04 16:48:18 -0600, Karl Denninger k...@denninger.net wrote:
 The subject machine in question has 12GB of RAM and dual Xeon
 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
 local cache and the BBU for it.  The ZFS spindles are all exported as
 JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
 partition added to them, are labeled and the providers are then
 geli-encrypted and added to the pool.
 What sort of disks?  SAS or SATA?
SATA.  They're clean; they report no errors, no retries, no corrected
data (ECC) etc.  They also have been running for a couple of years under
UFS+SU without problems.  This isn't new hardware; it's an in-service
system.

 also known good.  I began to get EXTENDED stalls with zero I/O going on,
 some lasting for 30 seconds or so.  The system was not frozen but
 anything that touched I/O would lock until it cleared.  Dedup is off,
 incidentally.
 When the system has stalled:
 - Do you see very low free memory?
Yes.  Effectively zero.
 - What happens to all the different CPU utilisation figures?  Do they
   all go to zero?  Do you get high system or interrupt CPU (including
   going to 1 core's worth)?
No, they start to fall.  This is a bad piece of data to trust though
because I am geli-encrypting the spindles, so falling CPU doesn't mean
the CPU is actually idle (since with no I/O there is nothing going
through geli.)  I'm working on instrumenting things sufficiently to try
to peel that off -- I suspect the kernel is spinning on something, but
the trick is finding out what it is.
 - What happens to interrupt load?  Do you see any disk controller
   interrupts?
None.

 Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN)
 and see if you get any errors when stalls happen.
If I have to.  That's easy to do on the test box -- on the production
one, not so much.
 On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick j...@koitsu.org wrote:
 On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
 Completely unrelated to the main thread:

 on 05/03/2013 07:32 Jeremy Chadwick said the following:
 That said, I still do not recommend ZFS for a root filesystem
 Why?
 Too long a history of problems with it and weird edge cases (keep
 reading); the last thing an administrator wants to deal with is a system
 where the root filesystem won't mount/can't be used.  It makes
 recovery or problem-solving (i.e. the server is not physically accessible
 given geographic distances) very difficult.
 I've had lots of problems with a gmirrored UFS root as well.  The
 biggest issue is that gmirror has no audit functionality so you
 can't verify that both sides of a mirror really do have the same data.
I have root on a 2-drive RAID mirror (done in the controller) and that
has been fine.  The controller does scrubs on a regular basis
internally.  The problem is that if it gets a clean read that is
different (e.g. no ECC indications, etc) it doesn't know which is the
correct copy.  The good news is that hasn't happened yet :-)

The risk of this happening as my data store continues to expand is one
of the reasons I want to move toward ZFS, but not necessarily for the
boot drives.  For the data store, however

 My point/opinion: UFS for a root filesystem is guaranteed to work
 without any fiddling about and, barring drive failures or controller
 issues, is (again, my opinion) a lot more risk-free than ZFS-on-root.
 AFAIK, you can't boot from anything other than a single disk (ie no
 graid).
Where I am right now is this:

1. I *CANNOT* reproduce the spins on the test machine with Postgres
stopped in any way.  Even with multiple ZFS send/recv copies going on
and the load average north of 20 (due to all the geli threads), the
system doesn't stall or produce any notable pauses in throughput.  Nor
does the system RAM allocation get driven hard enough to force paging. 

This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
both stable and solid.

2. WITH Postgres running as a connected hot spare (identical to the
production machine), allocating ~1.5G of shared, wired memory,  running
the same synthetic workload in (1) above I am getting SMALL versions of
the misbehavior.  However, while system RAM allocation gets driven
pretty hard and reaches down toward 100MB in some instances it doesn't
get driven hard enough to allocate swap.  The burstiness is very
evident in the iostat figures with spates getting into the single digit
MB/sec range from time to time but it's not enough to drive the system
to a full-on stall.

There's pretty-clearly a bad interaction here between Postgres wiring
memory and the ARC, when the latter is left alone and allowed to do what
it wants.   I'm continuing to work on replicating this on the test
machine... just not completely there yet.


-- 
-- Karl Denninger
/The Market Ticker ®/ http://market-ticker.org
Cuda Systems LLC


signature.asc
Description: OpenPGP 

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-07 Thread Steven Hartland


- Original Message - 
From: Karl Denninger k...@denninger.net

Where I am right now is this:

1. I *CANNOT* reproduce the spins on the test machine with Postgres
stopped in any way.  Even with multiple ZFS send/recv copies going on
and the load average north of 20 (due to all the geli threads), the
system doesn't stall or produce any notable pauses in throughput.  Nor
does the system RAM allocation get driven hard enough to force paging. 


This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
both stable and solid.

2. WITH Postgres running as a connected hot spare (identical to the
production machine), allocating ~1.5G of shared, wired memory,  running
the same synthetic workload in (1) above I am getting SMALL versions of
the misbehavior.  However, while system RAM allocation gets driven
pretty hard and reaches down toward 100MB in some instances it doesn't
get driven hard enough to allocate swap.  The burstiness is very
evident in the iostat figures with spates getting into the single digit
MB/sec range from time to time but it's not enough to drive the system
to a full-on stall.

There's pretty-clearly a bad interaction here between Postgres wiring
memory and the ARC, when the latter is left alone and allowed to do what
it wants.   I'm continuing to work on replicating this on the test
machine... just not completely there yet.


Another possibility to consider is how postgres uses the FS. For example
does is request sync IO in ways not present in the system without it
which is causing the FS and possibly underlying disk system to behave
differently.

One other options to test, just to rule it out is what happens if you
use BSD scheduler instead of ULE?

   Regards
   Steve



This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-07 Thread Karl Denninger

On 3/7/2013 12:57 PM, Steven Hartland wrote:

 - Original Message - From: Karl Denninger k...@denninger.net
 Where I am right now is this:

 1. I *CANNOT* reproduce the spins on the test machine with Postgres
 stopped in any way.  Even with multiple ZFS send/recv copies going on
 and the load average north of 20 (due to all the geli threads), the
 system doesn't stall or produce any notable pauses in throughput.  Nor
 does the system RAM allocation get driven hard enough to force paging.
 This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
 both stable and solid.

 2. WITH Postgres running as a connected hot spare (identical to the
 production machine), allocating ~1.5G of shared, wired memory,  running
 the same synthetic workload in (1) above I am getting SMALL versions of
 the misbehavior.  However, while system RAM allocation gets driven
 pretty hard and reaches down toward 100MB in some instances it doesn't
 get driven hard enough to allocate swap.  The burstiness is very
 evident in the iostat figures with spates getting into the single digit
 MB/sec range from time to time but it's not enough to drive the system
 to a full-on stall.

 There's pretty-clearly a bad interaction here between Postgres wiring
 memory and the ARC, when the latter is left alone and allowed to do what
 it wants.   I'm continuing to work on replicating this on the test
 machine... just not completely there yet.

 Another possibility to consider is how postgres uses the FS. For example
 does is request sync IO in ways not present in the system without it
 which is causing the FS and possibly underlying disk system to behave
 differently.

That's possible but not terribly-likely in this particular instance.  
The reason is that I ran into this with the Postgres data store on a UFS
volume BEFORE I converted it.  Now it's on the ZFS pool (with
recordsize=8k as recommended for that filesystem) but when I first ran
into this it was on a separate UFS filesystem (which is where it had
resided for 2+ years without incident), so unless the Postgres
filesystem use on a UFS volume would give ZFS fits it's unlikely to be
involved.

 One other options to test, just to rule it out is what happens if you
 use BSD scheduler instead of ULE?

Regards
Steve


I will test that but first I have to get the test machine to reliably
stall so I know I'm not chasing my tail.


-- 
-- Karl Denninger
/The Market Ticker ®/ http://market-ticker.org
Cuda Systems LLC
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-07 Thread Steven Hartland


- Original Message - 
From: Karl Denninger k...@denninger.net

To: freebsd-stable@freebsd.org
Sent: Thursday, March 07, 2013 7:07 PM
Subject: Re: ZFS stalls -- and maybe we should be talking about defaults?



On 3/7/2013 12:57 PM, Steven Hartland wrote:


- Original Message - From: Karl Denninger k...@denninger.net

Where I am right now is this:

1. I *CANNOT* reproduce the spins on the test machine with Postgres
stopped in any way.  Even with multiple ZFS send/recv copies going on
and the load average north of 20 (due to all the geli threads), the
system doesn't stall or produce any notable pauses in throughput.  Nor
does the system RAM allocation get driven hard enough to force paging.
This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
both stable and solid.

2. WITH Postgres running as a connected hot spare (identical to the
production machine), allocating ~1.5G of shared, wired memory,  running
the same synthetic workload in (1) above I am getting SMALL versions of
the misbehavior.  However, while system RAM allocation gets driven
pretty hard and reaches down toward 100MB in some instances it doesn't
get driven hard enough to allocate swap.  The burstiness is very
evident in the iostat figures with spates getting into the single digit
MB/sec range from time to time but it's not enough to drive the system
to a full-on stall.


There's pretty-clearly a bad interaction here between Postgres wiring
memory and the ARC, when the latter is left alone and allowed to do what
it wants.   I'm continuing to work on replicating this on the test
machine... just not completely there yet.


Another possibility to consider is how postgres uses the FS. For example
does is request sync IO in ways not present in the system without it
which is causing the FS and possibly underlying disk system to behave
differently.


That's possible but not terribly-likely in this particular instance.  
The reason is that I ran into this with the Postgres data store on a UFS

volume BEFORE I converted it.  Now it's on the ZFS pool (with
recordsize=8k as recommended for that filesystem) but when I first ran
into this it was on a separate UFS filesystem (which is where it had
resided for 2+ years without incident), so unless the Postgres
filesystem use on a UFS volume would give ZFS fits it's unlikely to be
involved.


I hate to say it, but that sounds very familiar to something we experienced
with a machine here which was running high numbers of rrd updates. Again
we had the issue on UFS and saw the same thing when we moved the ZFS.

I'll leave that there as to not derail the investigation with what could
be totally irrelavent info, but it may prove an interesting data point
later.

There are obvious common low level points between UFS and ZFS which
may be the cause. One area which springs to mind is device bio ordering
and barriers which could well be impacted by sync IO requests independent
of the FS in use.


One other options to test, just to rule it out is what happens if you
use BSD scheduler instead of ULE?


I will test that but first I have to get the test machine to reliably
stall so I know I'm not chasing my tail.


Very sensible.

Assuming you can reproduce it, one thing that might be interesting to
try is to eliminate all sync IO. I'm not sure if there are options in
Postgres to do this via configuration or if it would require editing
the code but this could reduce the problem space.

If disabling sync IO eliminated the problem it would go a long way
to proving it isn't the IO volume or pattern per say but instead
related to the sync nature of said IO.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-07 Thread Karl Denninger

On 3/7/2013 1:27 PM, Steven Hartland wrote:

 - Original Message - From: Karl Denninger k...@denninger.net
 To: freebsd-stable@freebsd.org
 Sent: Thursday, March 07, 2013 7:07 PM
 Subject: Re: ZFS stalls -- and maybe we should be talking about
 defaults?



 On 3/7/2013 12:57 PM, Steven Hartland wrote:

 - Original Message - From: Karl Denninger k...@denninger.net
 Where I am right now is this:

 1. I *CANNOT* reproduce the spins on the test machine with Postgres
 stopped in any way.  Even with multiple ZFS send/recv copies going on
 and the load average north of 20 (due to all the geli threads), the
 system doesn't stall or produce any notable pauses in throughput.  Nor
 does the system RAM allocation get driven hard enough to force paging.
 This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
 both stable and solid.

 2. WITH Postgres running as a connected hot spare (identical to the
 production machine), allocating ~1.5G of shared, wired memory,  running
 the same synthetic workload in (1) above I am getting SMALL versions of
 the misbehavior.  However, while system RAM allocation gets driven
 pretty hard and reaches down toward 100MB in some instances it doesn't
 get driven hard enough to allocate swap.  The burstiness is very
 evident in the iostat figures with spates getting into the single digit
 MB/sec range from time to time but it's not enough to drive the system
 to a full-on stall.

 There's pretty-clearly a bad interaction here between Postgres wiring
 memory and the ARC, when the latter is left alone and allowed to do
 what
 it wants.   I'm continuing to work on replicating this on the test
 machine... just not completely there yet.

 Another possibility to consider is how postgres uses the FS. For
 example
 does is request sync IO in ways not present in the system without it
 which is causing the FS and possibly underlying disk system to behave
 differently.

 That's possible but not terribly-likely in this particular instance. 
 The reason is that I ran into this with the Postgres data store on a UFS
 volume BEFORE I converted it.  Now it's on the ZFS pool (with
 recordsize=8k as recommended for that filesystem) but when I first ran
 into this it was on a separate UFS filesystem (which is where it had
 resided for 2+ years without incident), so unless the Postgres
 filesystem use on a UFS volume would give ZFS fits it's unlikely to be
 involved.

 I hate to say it, but that sounds very familiar to something we
 experienced
 with a machine here which was running high numbers of rrd updates. Again
 we had the issue on UFS and saw the same thing when we moved the ZFS.

 I'll leave that there as to not derail the investigation with what could
 be totally irrelavent info, but it may prove an interesting data point
 later.

 There are obvious common low level points between UFS and ZFS which
 may be the cause. One area which springs to mind is device bio ordering
 and barriers which could well be impacted by sync IO requests independent
 of the FS in use.

 One other options to test, just to rule it out is what happens if you
 use BSD scheduler instead of ULE?

 I will test that but first I have to get the test machine to reliably
 stall so I know I'm not chasing my tail.

 Very sensible.

 Assuming you can reproduce it, one thing that might be interesting to
 try is to eliminate all sync IO. I'm not sure if there are options in
 Postgres to do this via configuration or if it would require editing
 the code but this could reduce the problem space.

 If disabling sync IO eliminated the problem it would go a long way
 to proving it isn't the IO volume or pattern per say but instead
 related to the sync nature of said IO.

That can be turned off in the Postgres configuration.  For obvious
reasons it's a very bad idea but it is able to be disabled without
actually changing the code itself.

I don't know if it shuts off ALL sync requests, but the documentation
says it does.

It's interesting that you ran into this with RRD going; the machine in
question does pull RRD data for Cacti, but it's such a small piece of
the total load profile that I considered it immaterial.

It might not be.

-- 
-- Karl Denninger
/The Market Ticker ®/ http://market-ticker.org
Cuda Systems LLC
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-07 Thread Steven Hartland
- Original Message - 
From: Karl Denninger k...@denninger.net



I will test that but first I have to get the test machine to reliably
stall so I know I'm not chasing my tail.


Very sensible.

Assuming you can reproduce it, one thing that might be interesting to
try is to eliminate all sync IO. I'm not sure if there are options in
Postgres to do this via configuration or if it would require editing
the code but this could reduce the problem space.

If disabling sync IO eliminated the problem it would go a long way
to proving it isn't the IO volume or pattern per say but instead
related to the sync nature of said IO.


That can be turned off in the Postgres configuration.  For obvious
reasons it's a very bad idea but it is able to be disabled without
actually changing the code itself.

I don't know if it shuts off ALL sync requests, but the documentation
says it does.

It's interesting that you ran into this with RRD going; the machine in
question does pull RRD data for Cacti, but it's such a small piece of
the total load profile that I considered it immaterial.

It might not be.


We never did get to the bottom of it but did come up with a fix.

Instead of using straight RRD interaction we switched all out code to
use rrdcached and put the files on SSD based pool, never had an issue
since.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-06 Thread Daniel Kalchev


On 06.03.13 02:42, Steven Hartland wrote:


- Original Message - From: Daniel Kalchev

On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote:


I say that knowing lots of people use ZFS-on-root, which is great -- I
just wonder how many of them have tested all the crazy scenarios and
then tried to boot from things.


I have verified that ZFS-on-root works reliably in all of the following
scenarios: single disk, one mirror vdev, many mirror vdevs, raidz.
Haven't found the time to test many raidz vdevs, I admit. :)


One thing to watch out for is the available BIOS boot disks. If you try
to do a large RAIDZ with lots of disk as you root pool your likely to
run into problems not because of any ZFS issue but simply because the
disks the BIOS sees and hence tries to boot may not be what you expect.


A prudent system administrator should understand this issue and verify 
that whatever (boot) architecture they come up with, is supported by 
their particular hardware and firmware. This is no different for ZFS 
than for any other case.


The 2nd stage boot from ZFS loader in FreeBSD could in fact end up with 
it's own drive detection code one day, which will eliminate it's 
dependence on BIOS at all. For relatively small systems, where the 
administrator might be careless enough to not consider all scenarios, 
today's BIOSes already provide support for enough devices (e.n. most 
motherboards provide 4-6 SATA ports etc).


Using separate boot pools of just few devices is what I do for large 
storage boxes too. Mostly because I want to be able to fiddle with data 
disks without caring that might impact the OS. Just make sure the BIOS 
does see these in the drives list it creates. That is, don't put the 
boot disks at the last positions in your chassis :) -- use the on-board 
SATA slots that are scanned first -- sadly, almost every vendor provides 
for such drives placed inside the chassis, which makes it very 
inconvenient if one of the drives dies.


Daniel
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-06 Thread John-Mark Gurney
Karl Denninger wrote this message on Tue, Mar 05, 2013 at 06:56 -0600:
 When it happens on my system anything that is CPU-bound continues to
 execute.  I can switch consoles and network I/O also works.  If I have
 an iostat running at the time all I/O counters go to and remain at zero
 while the stall is occurring, but the process that is producing the
 iostat continues to run and emit characters whether it is a ssh session
 or on the physical console.  
 
 The CPUs are running and processing, but all threads block if they
 attempt access to the disk I/O subsystem, irrespective of the portion of
 the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS)  I
 therefore cannot start any new process that requires image activation.

Since it seems like there is a thread that is spinning... Has anyone
thought to modify kgdb to mlockall it's memory and run it against the
current system (kgdb /boot/kernel/kernel /dev/mem), and then when the
thread goes busy, use kgdb to see what where it's spinning?

Just a thought...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-06 Thread Peter Jeremy
On 2013-Mar-04 16:48:18 -0600, Karl Denninger k...@denninger.net wrote:
The subject machine in question has 12GB of RAM and dual Xeon
5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
local cache and the BBU for it.  The ZFS spindles are all exported as
JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
partition added to them, are labeled and the providers are then
geli-encrypted and added to the pool.

What sort of disks?  SAS or SATA?

also known good.  I began to get EXTENDED stalls with zero I/O going on,
some lasting for 30 seconds or so.  The system was not frozen but
anything that touched I/O would lock until it cleared.  Dedup is off,
incidentally.

When the system has stalled:
- Do you see very low free memory?
- What happens to all the different CPU utilisation figures?  Do they
  all go to zero?  Do you get high system or interrupt CPU (including
  going to 1 core's worth)?
- What happens to interrupt load?  Do you see any disk controller
  interrupts?

Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN)
and see if you get any errors when stalls happen.

On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick j...@koitsu.org wrote:
On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
 Completely unrelated to the main thread:
 
 on 05/03/2013 07:32 Jeremy Chadwick said the following:
  That said, I still do not recommend ZFS for a root filesystem
 Why?
Too long a history of problems with it and weird edge cases (keep
reading); the last thing an administrator wants to deal with is a system
where the root filesystem won't mount/can't be used.  It makes
recovery or problem-solving (i.e. the server is not physically accessible
given geographic distances) very difficult.

I've had lots of problems with a gmirrored UFS root as well.  The
biggest issue is that gmirror has no audit functionality so you
can't verify that both sides of a mirror really do have the same data.

My point/opinion: UFS for a root filesystem is guaranteed to work
without any fiddling about and, barring drive failures or controller
issues, is (again, my opinion) a lot more risk-free than ZFS-on-root.

AFAIK, you can't boot from anything other than a single disk (ie no
graid).

-- 
Peter Jeremy


pgp7H3m449swl.pgp
Description: PGP signature


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Steven Hartland


- Original Message - 
From: Jeremy Chadwick j...@koitsu.org

To: Ben Morrow b...@morrow.me.uk
Cc: freebsd-stable@freebsd.org
Sent: Tuesday, March 05, 2013 5:32 AM
Subject: Re: ZFS stalls -- and maybe we should be talking about defaults?



On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote:

Quoth Karl Denninger k...@denninger.net:
 
 Note that the machine is not booting from ZFS -- it is booting from and

 has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
 like a single da0 drive to the OS) and that drive stalls as well when
 it freezes.  It's definitely a kernel thing when it happens as the OS
 would otherwise not have locked (just I/O to the user partitions) -- but
 it does. 


Is it still the case that mixing UFS and ZFS can cause problems, or were
they all fixed? I remember a while ago (before the arc usage monitoring
code was added) there were a number of reports of serious probles
running an rsync from UFS to ZFS.


This problem still exists on stable/9.  The behaviour manifests itself
as fairly bad performance (I cannot remember if stalling or if just
throughput rates were awful).  I can only speculate as to what the root
cause is, but my guess is that it has something to do with the two
caching systems (UFS vs. ZFS ARC) fighting over large sums of memory.


In our case we have no UFS, so this isn't the cause of the stalls.
Spec here is
* 64GB RAM
* LSI 2008
* 8.3-RELEASE
* Pure ZFS
* Trigger MySQL doing a DB import, nothing else running.
* 4K disk alignment

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Jeremy Chadwick
On Tue, Mar 05, 2013 at 09:12:47AM -, Steven Hartland wrote:
 
 - Original Message - From: Jeremy Chadwick
 j...@koitsu.org
 To: Ben Morrow b...@morrow.me.uk
 Cc: freebsd-stable@freebsd.org
 Sent: Tuesday, March 05, 2013 5:32 AM
 Subject: Re: ZFS stalls -- and maybe we should be talking about defaults?
 
 
 On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote:
 Quoth Karl Denninger k...@denninger.net:
   Note that the machine is not booting from ZFS -- it is
 booting from and
  has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
  like a single da0 drive to the OS) and that drive stalls as well when
  it freezes.  It's definitely a kernel thing when it happens as the OS
  would otherwise not have locked (just I/O to the user partitions) -- but
  it does.
 
 Is it still the case that mixing UFS and ZFS can cause problems, or were
 they all fixed? I remember a while ago (before the arc usage monitoring
 code was added) there were a number of reports of serious probles
 running an rsync from UFS to ZFS.
 
 This problem still exists on stable/9.  The behaviour manifests itself
 as fairly bad performance (I cannot remember if stalling or if just
 throughput rates were awful).  I can only speculate as to what the root
 cause is, but my guess is that it has something to do with the two
 caching systems (UFS vs. ZFS ARC) fighting over large sums of memory.
 
 In our case we have no UFS, so this isn't the cause of the stalls.
 Spec here is
 * 64GB RAM
 * LSI 2008
 * 8.3-RELEASE
 * Pure ZFS
 * Trigger MySQL doing a DB import, nothing else running.
 * 4K disk alignment

1. Is compression enabled?  Has it ever been enabled (on any fs) in the
past (barring pool being destroyed + recreated)?

2. Is dedup enabled?  Has it ever been enabled (on any fs) in the past
(barring pool being destroyed + recreated)?

I can speculate day and night about what could cause this kind of issue,
honestly.  The possibilities are quite literally infinite, and all of
them require folks deeply familiar with both FreeBSD's ZFS as well as
very key/major parts of the kernel (ranging from VM to interrupt
handlers to I/O subsystem).  (This next comment isn't for you, Steve,
you already know this :-) )  The way different pieces of the kernel
interact with one another is fairly complex; the kernel is not simple.

Things I think that might prove useful:

* Describing the stall symptoms; what all does it impact?  Can you
  switch VTYs on console when its happening?  Network I/O (e.g. SSH'd
  into the same box and just holding down a letter) showing stalls
  then catching up?  Things of this nature.
* How long the stall is in duration (ex. if there's some way to
  roughly calculate this using date in a shell script)
* Contents of /etc/sysctl.conf and /boot/loader.conf (re: tweaking
  of the system)
* sysctl -a | grep zfs before and after a stall -- do not bother
  with those ARC summaries scripts please, at least not for this
* vmstat -z before and after a stall
* vmstat -m before and after a stall
* vmstat -s before and after a stall
* vmstat -i before, after, AND during a stall

Basically, every person who experiences this problem needs to treat
every situation uniquely -- no me too -- and try to find reliable 100%
test cases for it.  That's the only way bugs of this nature (i.e.
of a complex nature) get fixed.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Andriy Gapon

Completely unrelated to the main thread:

on 05/03/2013 07:32 Jeremy Chadwick said the following:
 That said, I still do not recommend ZFS for a root filesystem

Why?

 (this biting people still happens even today)

What exactly?

 - Disks are GPT and are *partitioned, and ZFS refers to the partitions
   not the raw disk -- this matters (honest, it really does; the ZFS
   code handles things differently with raw disks)

Not on FreeBSD as far I can see.


P.S. I completely agree with your suggestions on simplifying the setup and
gathering objective information for the purpose of debugging the issue.
I also completely agree that me too-ing is not very useful (and often 
completely
incorrect) for the complex problems like this one.
Thank you.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Karl Denninger
On 3/5/2013 3:27 AM, Jeremy Chadwick wrote:
 On Tue, Mar 05, 2013 at 09:12:47AM -, Steven Hartland wrote:
 - Original Message - From: Jeremy Chadwick
 j...@koitsu.org
 To: Ben Morrow b...@morrow.me.uk
 Cc: freebsd-stable@freebsd.org
 Sent: Tuesday, March 05, 2013 5:32 AM
 Subject: Re: ZFS stalls -- and maybe we should be talking about defaults?


 On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote:
 Quoth Karl Denninger k...@denninger.net:
 Note that the machine is not booting from ZFS -- it is
 booting from and
 has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
 like a single da0 drive to the OS) and that drive stalls as well when
 it freezes.  It's definitely a kernel thing when it happens as the OS
 would otherwise not have locked (just I/O to the user partitions) -- but
 it does.
 Is it still the case that mixing UFS and ZFS can cause problems, or were
 they all fixed? I remember a while ago (before the arc usage monitoring
 code was added) there were a number of reports of serious probles
 running an rsync from UFS to ZFS.
 This problem still exists on stable/9.  The behaviour manifests itself
 as fairly bad performance (I cannot remember if stalling or if just
 throughput rates were awful).  I can only speculate as to what the root
 cause is, but my guess is that it has something to do with the two
 caching systems (UFS vs. ZFS ARC) fighting over large sums of memory.
 In our case we have no UFS, so this isn't the cause of the stalls.
 Spec here is
 * 64GB RAM
 * LSI 2008
 * 8.3-RELEASE
 * Pure ZFS
 * Trigger MySQL doing a DB import, nothing else running.
 * 4K disk alignment
 1. Is compression enabled?  Has it ever been enabled (on any fs) in the
 past (barring pool being destroyed + recreated)?

 2. Is dedup enabled?  Has it ever been enabled (on any fs) in the past
 (barring pool being destroyed + recreated)?

 I can speculate day and night about what could cause this kind of issue,
 honestly.  The possibilities are quite literally infinite, and all of
 them require folks deeply familiar with both FreeBSD's ZFS as well as
 very key/major parts of the kernel (ranging from VM to interrupt
 handlers to I/O subsystem).  (This next comment isn't for you, Steve,
 you already know this :-) )  The way different pieces of the kernel
 interact with one another is fairly complex; the kernel is not simple.

 Things I think that might prove useful:

 * Describing the stall symptoms; what all does it impact?  Can you
   switch VTYs on console when its happening?  Network I/O (e.g. SSH'd
   into the same box and just holding down a letter) showing stalls
   then catching up?  Things of this nature.
When it happens on my system anything that is CPU-bound continues to
execute.  I can switch consoles and network I/O also works.  If I have
an iostat running at the time all I/O counters go to and remain at zero
while the stall is occurring, but the process that is producing the
iostat continues to run and emit characters whether it is a ssh session
or on the physical console.  

The CPUs are running and processing, but all threads block if they
attempt access to the disk I/O subsystem, irrespective of the portion of
the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS)  I
therefore cannot start any new process that requires image activation.

 * How long the stall is in duration (ex. if there's some way to
   roughly calculate this using date in a shell script)
They're variable.  Some last fractions of a second and are not really
all that noticeable unless you happen to be paying CLOSE attention. 
Some last a few (5 or so) seconds.  The really bad ones last long enough
that the kernel throws the message swap_pager: indefinite wait buffer.

The machine in the general sense never pages.  It contains 12GB of RAM
but historically (prior to ZFS being put into service) always showed 0
for a pstat -s, although it does have a 20g raw swap partition (to
/dev/da0s1b, not to a zpool) allocated.

During the stalls I cannot run a pstat (I tried; it stalls) but when it
unlocks I find that there is swap allocated, albeit not a ridiculous
amount.  ~20,000 pages or so have made it to the swap partition. This is
not behavior that I had seen before on this machine prior to the stall
problem, and with the two tuning tweaks discussed here I'm now up to 48
hours without any allocation to swap (or any stalls.)

 * Contents of /etc/sysctl.conf and /boot/loader.conf (re: tweaking
   of the system)
/boot/loader.conf:

kern.ipc.semmni=256
kern.ipc.semmns=512
kern.ipc.semmnu=256
geom_eli_load=YES
sound_load=YES
#
# Limit to physical CPU count for threads
#
kern.geom.eli.threads=8
#
# ZFS Prefetch does help, although you'd think it would not due to the
adapter
# doing it already.  Wrong guess; it's good for 2x the performance.
# We limit the ARC to 2GB of RAM and the TXG write limit to 1GB.
#
#vfs.zfs.prefetch_disable=1
vfs.zfs.arc_max=20
vfs.zfs.write_limit_override=102400

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Gary Palmer
On Tue, Mar 05, 2013 at 12:40:38AM -0500, Garrett Wollman wrote:
 In article 8c68812328e3483ba9786ef155911...@multiplay.co.uk,
 kill...@multiplay.co.uk writes:
 
 Now interesting you should say that I've seen a stall recently on ZFS
 only box running on 6 x SSD RAIDZ2.
 
 The stall was caused by fairly large mysql import, with nothing else
 running.
 
 Then it happened I thought the machine had wedged, but minutes (not
 seconds) later, everything sprung into action again.
 
 I have certainly seen what you might describe as stalls, caused, so
 far as I can tell, by kernel memory starvation.  I've seen it take as
 much as a half an hour to recover from these (which is too long for my
 users).  Right now I have the ARC limited to 64 GB (on a 96 GB file
 server) and that has made it more stable, but it's still not behaving
 quite as I would like, and I'm looking to put more memory into the
 system (to be used for non-ARC functions).  Looking at my munin
 graphs, I find that backups in particular put very heavy pressure on,
 doubling the UMA allocations over steady-state, and this takes about
 four or five hours to climb back down.  See
 http://people.freebsd.org/~wollman/vmstat_z-day.png for an example.
 
 Some of the stalls are undoubtedly caused by internal fragmentation
 rather than actual data in use.  (Solaris used to have this issue, and
 some hooks were added to allow some amount of garbage collection with
 the cooperation of the filesystem.)

Just as a note that there was a page I read in the past few months
that pointed out that having a huge ARC may not always be in the best
interests of the system.  Some operation on the filesystem (I forget
what, apologies) caused the system to churn through the ARC and discard
most of it, while regular I/O was blocked

Unfortunately I cannot remember where I found that page now and I don't
appear to have bookmarked it

From what has been said in this thread I'm not convinced that people
are hitting this issue, however I would like to raise it for
consideration

Regards,

Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Freddie Cash
On Tue, Mar 5, 2013 at 7:22 AM, Gary Palmer gpal...@freebsd.org wrote:

 Just as a note that there was a page I read in the past few months
 that pointed out that having a huge ARC may not always be in the best
 interests of the system.  Some operation on the filesystem (I forget
 what, apologies) caused the system to churn through the ARC and discard
 most of it, while regular I/O was blocked


Huh.  What timing.  I've been fighting with our largest ZFS box (128 GB of
RAM, 16 CPU cores, 2x SSD for SLOG, 2x SSD for L2ARC, 45x 2 TB HD for pool
in 6-driive raidz2 vdevs) for the past week trying to figure out why ZFS
send/recv just hangs after awhile.  Everything is stuck in D in ps ax
output, and top show the l2arc_feed_ thread using 100% of one CPU.  Even
removing the L2ARC devices from the pool doesn't help, just slows the
amount of time until the hang.

ARC was configured for 120 GB, with arc_meta_limit set to 90 GB.  Yes,
dedup and compression are enabled (it's a backups storage box, and we get
over 5x combined dedup/compress ratio).  After several hours of running,
the ARC and wired would get up to 100+ GB, and the box would spend most of
its time spinning, with almost 0 I/O to the pool (only a few KB/s of
reads in zpool iostat 1 or gstat).

ZFS send/recv would eventually complete, but what used to take 15-20
minutes would take 6-8 hours to complete.

I've reduced the ARC to only 32 GB, with arc_meta set to 28 GB, and things
are running much smoother now (50-200 MB/s writes for 3-5 seconds every
10s), and send/recv is back down to 10-15 minutes.

Who would have thought too much RAM would be an issue?

Will play with this over the next couple of days with different ARC max
settings to see where the problems start.  All of our ZFS boxes until this
one had under 64 GB of RAM.  (And we had issues with dedupe enabled on
boxes with too little RAM, as in under 32 GB.)

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Jeremy Chadwick
On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
 Completely unrelated to the main thread:
 
 on 05/03/2013 07:32 Jeremy Chadwick said the following:
  That said, I still do not recommend ZFS for a root filesystem
 
 Why?

Too long a history of problems with it and weird edge cases (keep
reading); the last thing an administrator wants to deal with is a system
where the root filesystem won't mount/can't be used.  It makes
recovery or problem-solving (i.e. the server is not physically accessible
given geographic distances) very difficult.

Are there still issues booting from raidzX or stripes or root pools with
multiple vdevs?  What about with cache or log devices?

My point/opinion: UFS for a root filesystem is guaranteed to work
without any fiddling about and, barring drive failures or controller
issues, is (again, my opinion) a lot more risk-free than ZFS-on-root.

I say that knowing lots of people use ZFS-on-root, which is great -- I
just wonder how many of them have tested all the crazy scenarios and
then tried to boot from things.

  (this biting people still happens even today)
 
 What exactly?

http://lists.freebsd.org/pipermail/freebsd-questions/2013-February/249363.html
http://lists.freebsd.org/pipermail/freebsd-questions/2013-February/249387.html
http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072398.html

The last one got solved:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072406.html
http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072408.html

I know factually you're aware of the zpool.cache ordeal (which may or
may not be the cause of the issue shown in the 2nd URL above), but my
point is that still at this moment in time -- barring someone using a
stable/9 ISO for installation -- there still seem to be issues.

Things on the mailing lists which go unanswered/never provide closure of
this nature are numerous, and that just adds to my concern.

  - Disks are GPT and are *partitioned, and ZFS refers to the partitions
not the raw disk -- this matters (honest, it really does; the ZFS
code handles things differently with raw disks)
 
 Not on FreeBSD as far I can see.

My statement comes from here (first line in particular):

http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html

If this is wrong/false, then this furthers my point about kernel folks
who are in-the-know needing to chime in and help stop the
misinformation.  The rest of us are just end-users, often misinformed.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Freddie Cash
On Tue, Mar 5, 2013 at 2:09 PM, Jeremy Chadwick j...@koitsu.org wrote:

 On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:

   - Disks are GPT and are *partitioned, and ZFS refers to the partitions
 not the raw disk -- this matters (honest, it really does; the ZFS
 code handles things differently with raw disks)
 
  Not on FreeBSD as far I can see.

 My statement comes from here (first line in particular):


 http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html

 If this is wrong/false, then this furthers my point about kernel folks
 who are in-the-know needing to chime in and help stop the
 misinformation.  The rest of us are just end-users, often misinformed.


This has been false from the very first import of ZFS into FreeBSD
7-STABLE.  Pawel even mentions that GEOM allows the use of the cache on
partitions with ZFS somewhere around that time frame.  Considering he did
the initial import of ZFS into FreeBSD, I don't think you can find a more
canonical answer.  :)

This is one of the biggest differences between the Solaris-based ZFS and
the FreeBSD-based ZFS.

It's too bad this mis-information has basically become a meme.  :(

-- 
Freddie Cash
fjwc...@gmail.com
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Jeremy Chadwick
On Tue, Mar 05, 2013 at 02:18:30PM -0800, Freddie Cash wrote:
 On Tue, Mar 5, 2013 at 2:09 PM, Jeremy Chadwick j...@koitsu.org wrote:
 
  On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
 
- Disks are GPT and are *partitioned, and ZFS refers to the partitions
  not the raw disk -- this matters (honest, it really does; the ZFS
  code handles things differently with raw disks)
  
   Not on FreeBSD as far I can see.
 
  My statement comes from here (first line in particular):
 
 
  http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html
 
  If this is wrong/false, then this furthers my point about kernel folks
  who are in-the-know needing to chime in and help stop the
  misinformation.  The rest of us are just end-users, often misinformed.
 
 This has been false from the very first import of ZFS into FreeBSD
 7-STABLE.  Pawel even mentions that GEOM allows the use of the cache on
 partitions with ZFS somewhere around that time frame.  Considering he did
 the initial import of ZFS into FreeBSD, I don't think you can find a more
 canonical answer.  :)
 
 This is one of the biggest differences between the Solaris-based ZFS and
 the FreeBSD-based ZFS.

This is good (excellent) information to know -- thank you for clearing
that up.

 It's too bad this mis-information has basically become a meme.  :(

Such is the case with FreeBSD's ZFS in general, solely because of the
fact that the number of people who can answer the deep technical
questions are few.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Daniel Kalchev

On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote:

 I say that knowing lots of people use ZFS-on-root, which is great -- I
 just wonder how many of them have tested all the crazy scenarios and
 then tried to boot from things.

I have verified that ZFS-on-root works reliably in all of the following 
scenarios: single disk, one mirror vdev, many mirror vdevs, raidz. Haven't 
found the time to test many raidz vdevs, I admit. :)

Combined with boot environments (that can be served many different ways), ZFS 
on root is short of a miracle.

ZFS on FreeBSD has some issues, mostly with huge installations and 
defaults/tuning, but not really with ZFS-on-root.

Of course, if for example, you follow stable, you should be prepared with 
alternative boot media that supports the current zpool/zfs versions. But this 
is small cost to pay.

Daniel
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Daniel Kalchev

On Mar 5, 2013, at 8:17 PM, Freddie Cash fjwc...@gmail.com wrote:

 
 ZFS send/recv would eventually complete, but what used to take 15-20
 minutes would take 6-8 hours to complete.
 
 I've reduced the ARC to only 32 GB, with arc_meta set to 28 GB, and things
 are running much smoother now (50-200 MB/s writes for 3-5 seconds every
 10s), and send/recv is back down to 10-15 minutes.
 
 Who would have thought too much RAM would be an issue?
 
 Will play with this over the next couple of days with different ARC max
 settings to see where the problems start.  All of our ZFS boxes until this
 one had under 64 GB of RAM.  (And we had issues with dedupe enabled on
 boxes with too little RAM, as in under 32 GB.)

I have an archive box running very similar setup as yours, but with 72GB of 
RAM. I have set both arc_max and arc_meta_limit to 64GB, with no issues. I am 
still doing a very complex snapshot reordering between two pools. One of the 
pools has dedup enabled (which prompted me to add RAM), with dedup ratio f over 
10x and there are still no issues or any stalling. The other pool has both 
dedup and compression for some filesystems. 

My only issue is that replacing a drive in either pool takes few days (6-drive 
vdevs of 3TB drives).

Perhaps the memory indexing/search algorithms are inefficient?

Daniel
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Steven Hartland


- Original Message - 
From: Daniel Kalchev

On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote:


I say that knowing lots of people use ZFS-on-root, which is great -- I
just wonder how many of them have tested all the crazy scenarios and
then tried to boot from things.


I have verified that ZFS-on-root works reliably in all of the following
scenarios: single disk, one mirror vdev, many mirror vdevs, raidz.
Haven't found the time to test many raidz vdevs, I admit. :)


One thing to watch out for is the available BIOS boot disks. If you try
to do a large RAIDZ with lots of disk as you root pool your likely to
run into problems not because of any ZFS issue but simply because the
disks the BIOS sees and hence tries to boot may not be what you expect.

It won't nessacarily hit you when you first install either, add more
disks at a later date to an multi controller LSI 2008 machine and you
can end up with not being able to specify the correct set of disks in
the bios. Yes learned that one the hard way :(

For larger storage boxes we've taken to using two SSD's paritioned
and used as the boot, ZIL as neither requires a massive amount space
they are a nice fit together.


Combined with boot environments (that can be served many different
ways), ZFS on root is short of a miracle.

ZFS on FreeBSD has some issues, mostly with huge installations and
defaults/tuning, but not really with ZFS-on-root.

Of course, if for example, you follow stable, you should be prepared
with alternative boot media that supports the current zpool/zfs versions.
But this is small cost to pay.


For anyone looking to do a zfs only install I would definitely recommend
they look at:- http://mfsbsd.vx.sk/ this little gem + custom script for
our env and it takes a few mins from boot to installed machine.

Its also our go to rescue disk, forget messing around with the standard
ISO's and their rescue option which never worked for me when I needed it,
this is fully work OS with all the tools you'll want when things go wrong
and if there is something missing its easy to compile and build your
own version.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Ben Morrow
Quoth Steven Hartland kill...@multiplay.co.uk:
 - Original Message - 
 From: Daniel Kalchev
  On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote:
  
  I say that knowing lots of people use ZFS-on-root, which is great -- I
  just wonder how many of them have tested all the crazy scenarios and
  then tried to boot from things.
  
  I have verified that ZFS-on-root works reliably in all of the following
  scenarios: single disk, one mirror vdev, many mirror vdevs, raidz.
  Haven't found the time to test many raidz vdevs, I admit. :)
 
 One thing to watch out for is the available BIOS boot disks. If you try
 to do a large RAIDZ with lots of disk as you root pool your likely to
 run into problems not because of any ZFS issue but simply because the
 disks the BIOS sees and hence tries to boot may not be what you expect.

IIRC the Sun documentation recommends keeping the root pool separate
from the data pools in any case.

Ben

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Jeremy Chadwick
On Tue, Mar 05, 2013 at 06:56:02AM -0600, Karl Denninger wrote:
 { I've snipped lots of text.  For those who are reading this follow-up }
 { and wish to read the snipped portions, please see this URL: }
 { http://lists.freebsd.org/pipermail/freebsd-stable/2013-March/072696.html }

  1. Is compression enabled?  Has it ever been enabled (on any fs) in the
  past (barring pool being destroyed + recreated)?
 
  2. Is dedup enabled?  Has it ever been enabled (on any fs) in the past
  (barring pool being destroyed + recreated)?

No answers to questions #1 and #2?  (Edit: see below, I believe it's
implied neither are used)

  * Describing the stall symptoms; what all does it impact?  Can you
switch VTYs on console when its happening?  Network I/O (e.g. SSH'd
into the same box and just holding down a letter) showing stalls
then catching up?  Things of this nature.
 When it happens on my system anything that is CPU-bound continues to
 execute.  I can switch consoles and network I/O also works.

Okay, it sounds like compression and dedup aren't in use/have never been
used.  The stalling problem with compression and dedup (e.g. if you use
either of these features, and it worsens if you use both) results in a
full/hard system stall where *everything* is impacted, and has been
explained in the past (2nd URL has the explanation):

http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html 
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html 
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html 

 If I have an iostat running at the time all I/O counters go to and
 remain at zero while the stall is occurring, but the process that is
 producing the iostat continues to run and emit characters whether it
 is a ssh session or on the physical console.  

What kind of an iostat?  iostat(8) or zpool iostat?

(Edit: last paragraph of this response says zpool iostat, which is not
the same thing as iostat)

Why not gstat(8), e.g. gstat -I500ms, as well?  This provides the I/O
statistics at a deeper layer, not the ZFS layer.

Do the numbers actually change **while the system is stalling**?

The answer matters greatly, because it would help indicate if some
kernel API requests for I/O statistics are also blocking, or if only
*actual I/O (e.g. read() and write() requests)* are blocking.

 The CPUs are running and processing, but all threads block if they
 attempt access to the disk I/O subsystem, irrespective of the portion
 of the disk I/O subsystem they attempt to access (e.g. UFS, swap or
 ZFS)  I therefore cannot start any new process that requires image
 activation.

And now you'll need to provide a full diagram of your disk and
controller device tree, along with all partitions, slices, and
filesystem types.  It's best to draw this in ASCII in a tree-like
diagram.  It will take you 15-20 minutes to do.

What's even more concerning:

This thread is about ZFS, yet you're saying applications block when they
attempt to do I/O to a filesystem ***other than ZFS***.  There must be
some kind of commonality here, i.e. a single controller is driving both
the ZFS and UFS disks, or something along those lines.  If there isn't,
then there is something within the kernel I/O subsystem that is doing
this.  Like I said: very deep, very knowledgeable kernel folks are the
only ones who can fix this.

  * How long the stall is in duration (ex. if there's some way to
roughly calculate this using date in a shell script)
 They're variable.  Some last fractions of a second and are not really
 all that noticeable unless you happen to be paying CLOSE attention. 
 Some last a few (5 or so) seconds.  The really bad ones last long enough
 that the kernel throws the message swap_pager: indefinite wait buffer.

The message swap_pager: indefinite wait buffer indicates that some
part of the VM is trying to offload pages of memory to swap via standard
I/O write requests, and those writes have not come back within kern.hz*20
seconds.  That's a very, very long time.

 The machine in the general sense never pages.  It contains 12GB of RAM
 but historically (prior to ZFS being put into service) always showed 0
 for a pstat -s, although it does have a 20g raw swap partition (to
 /dev/da0s1b, not to a zpool) allocated.

The swap_pager message implies otherwise.  It may be that the programs
you're using poll at intervals of, say, 1 second, and swap-out + swap-in
occurs very quickly so you never see it.  (Edit: next quoted paragraph
shows that there ARE pages of memory hitting swap, so never pages is
false).

I do not know the VM subsystem well enough to know what the criteria are
for offloading pages of memory to swap -- but it's obviously happening.
It may be due to memory pressure, or it may be due to pages which have
not been touched in a long while -- again, I do not know.  This is
where vmstat -s would be useful.  Possibly Alan Cox knows.

 During the stalls I cannot run a pstat (I tried; it stalls) 

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-05 Thread Jeremy Chadwick
On Tue, Mar 05, 2013 at 09:08:09PM -0800, Jeremy Chadwick wrote:
   * How long the stall is in duration (ex. if there's some way to
 roughly calculate this using date in a shell script)
  They're variable.  Some last fractions of a second and are not really
  all that noticeable unless you happen to be paying CLOSE attention. 
  Some last a few (5 or so) seconds.  The really bad ones last long enough
  that the kernel throws the message swap_pager: indefinite wait buffer.
 
 The message swap_pager: indefinite wait buffer indicates that some
 part of the VM is trying to offload pages of memory to swap via standard
 I/O write requests, and those writes have not come back within kern.hz*20
 seconds.  That's a very, very long time.

Two clarification points:

1. The timeout value is passed to msleep(9) and is literally kern.hz*20.
Per sys/vm/swap_pager.c:

   1216 if (msleep(mreq, VM_OBJECT_MTX(object), PSWP, swread, 
hz*20)) {
   1217 printf(
   1218 swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: 
%ld\n,
   1219 bp-b_bufobj, (intmax_t)bp-b_blkno, 
bp-b_bcount);

How that's interpreted is documented in msleep(9):

 The parameter timo specifies a timeout for the sleep.  If timo is
 not 0, then the thread will sleep for at most timo / hz seconds.
 If the timeout expires, then the sleep function will return
 EWOULDBLOCK.

2. The message appears to be for swap I/O *reads*, not writes; at least
that's what the swread STATE string (you know, what you see in
top(1)) implies.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Karl Denninger
Well now this is interesting.

I have converted a significant number of filesystems to ZFS over the
last week or so and have noted a few things.  A couple of them aren't so
good.

The subject machine in question has 12GB of RAM and dual Xeon
5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
local cache and the BBU for it.  The ZFS spindles are all exported as
JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
partition added to them, are labeled and the providers are then
geli-encrypted and added to the pool.  When the same disks were running
on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
adapter, exported as a single unit, GPT labeled as a single pack and
then gpart-sliced and newfs'd under UFS+SU.

Since I previously ran UFS filesystems on this config I know what the
performance level I achieved with that, and the entire system had been
running flawlessly set up that way for the last couple of years. 
Presently the machine is running 9.1-Stable, r244942M

Immediately after the conversion I set up a second pool to play with
backup strategies to a single drive and ran into a problem.  The disk I
used for that testing is one that previously was in the rotation and is
also known good.  I began to get EXTENDED stalls with zero I/O going on,
some lasting for 30 seconds or so.  The system was not frozen but
anything that touched I/O would lock until it cleared.  Dedup is off,
incidentally.

My first thought was that I had a bad drive, cable or other physical
problem.  However, searching for that proved fruitless -- there was
nothing being logged anywhere -- not in the SMART data, not by the
adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
the +5V and +12V rails didn't disclose anything interesting with the
power in the chassis; it's stable.  Further, swapping the only disk that
had changed (the new backup volume) with a different one didn't change
behavior either.

The last straw was when I was able to reproduce the stalls WITHIN the
original pool against the same four disks that had been running
flawlessly for two years under UFS, and still couldn't find any evidence
of a hardware problem (not even ECC-corrected data returns.)  All the
disks involved are completely clean -- zero sector reassignments, the
drive-specific log is clean, etc.

Attempting to cut back the ARECA adapter's aggressiveness (buffering,
etc) on the theory that I was tickling something in its cache management
algorithm that was pissing it off proved fruitless as well, even when I
shut off ALL caching and NCQ options.  I also set
vfs.zfs.prefetch_disable=1 to no effect.  H...

Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set
vfs.zfs.write_limit_override to 102400 (1GB) and rebooted.  /*

The problem instantly disappeared and I cannot provoke its return even
with multiple full-bore snapshot and rsync filesystem copies running
while a scrub is being done.*/
/**/
I'm pinging between being I/O and processor (geli) limited now in normal
operation and slamming the I/O channel during a scrub.  It appears that
performance is roughly equivalent, maybe a bit less, than it was with
UFS+SU -- but it's fairly close.

The operating theory I have at the moment is that the ARC cache was in
some way getting into a near-deadlock situation with other memory
demands on the system (there IS a Postgres server running on this
hardware although it's a replication server and not taking queries --
nonetheless it does grab a chunk of RAM) leading to the stalls. 
Limiting its grab of RAM appears to have to resolved the contention
issue.  I was unable to catch it actually running out of free memory
although it was consistently into the low five-digit free page count and
the kernel never garfed on the console about resource exhaustion --
other than a bitch about swap stalling (the infamous more than 20
seconds message.)  Page space in use near the time in question (I could
not get a display while locked as it went to I/O and froze) was not
zero, but pretty close to it (a few thousand blocks.)  That the system
was driven into light paging does appear to be significant and
indicative of some sort of memory contention issue as under operation
with UFS filesystems this machine has never been observed to allocate
page space.

Anyone seen anything like this before and if so is this a case of
bad defaults or some bad behavior between various kernel memory
allocation contention sources?

This isn't exactly a resource-constrained machine running x64 code with
12GB of RAM and two quad-core processors in it!

-- 
-- Karl Denninger
/The Market Ticker ®/ http://market-ticker.org
Cuda Systems LLC
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to 

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Steven Hartland

What does zfs-stats -a show when your having the stall issue?

You can also use zfs iostats to show individual disk iostats
which may help identify a single failing disk e.g.
zpool iostat -v 1

Also have you investigated which of the two sysctls you changed
fixed it or does it require both?

   Regards
   Steve

- Original Message - 
From: Karl Denninger k...@denninger.net

To: freebsd-stable@freebsd.org
Sent: Monday, March 04, 2013 10:48 PM
Subject: ZFS stalls -- and maybe we should be talking about defaults?


Well now this is interesting.

I have converted a significant number of filesystems to ZFS over the
last week or so and have noted a few things.  A couple of them aren't so
good.

The subject machine in question has 12GB of RAM and dual Xeon
5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
local cache and the BBU for it.  The ZFS spindles are all exported as
JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
partition added to them, are labeled and the providers are then
geli-encrypted and added to the pool.  When the same disks were running
on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
adapter, exported as a single unit, GPT labeled as a single pack and
then gpart-sliced and newfs'd under UFS+SU.

Since I previously ran UFS filesystems on this config I know what the
performance level I achieved with that, and the entire system had been
running flawlessly set up that way for the last couple of years.
Presently the machine is running 9.1-Stable, r244942M

Immediately after the conversion I set up a second pool to play with
backup strategies to a single drive and ran into a problem.  The disk I
used for that testing is one that previously was in the rotation and is
also known good.  I began to get EXTENDED stalls with zero I/O going on,
some lasting for 30 seconds or so.  The system was not frozen but
anything that touched I/O would lock until it cleared.  Dedup is off,
incidentally.

My first thought was that I had a bad drive, cable or other physical
problem.  However, searching for that proved fruitless -- there was
nothing being logged anywhere -- not in the SMART data, not by the
adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
the +5V and +12V rails didn't disclose anything interesting with the
power in the chassis; it's stable.  Further, swapping the only disk that
had changed (the new backup volume) with a different one didn't change
behavior either.

The last straw was when I was able to reproduce the stalls WITHIN the
original pool against the same four disks that had been running
flawlessly for two years under UFS, and still couldn't find any evidence
of a hardware problem (not even ECC-corrected data returns.)  All the
disks involved are completely clean -- zero sector reassignments, the
drive-specific log is clean, etc.

Attempting to cut back the ARECA adapter's aggressiveness (buffering,
etc) on the theory that I was tickling something in its cache management
algorithm that was pissing it off proved fruitless as well, even when I
shut off ALL caching and NCQ options.  I also set
vfs.zfs.prefetch_disable=1 to no effect.  H...

Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set
vfs.zfs.write_limit_override to 102400 (1GB) and rebooted.  /*

The problem instantly disappeared and I cannot provoke its return even
with multiple full-bore snapshot and rsync filesystem copies running
while a scrub is being done.*/
/**/
I'm pinging between being I/O and processor (geli) limited now in normal
operation and slamming the I/O channel during a scrub.  It appears that
performance is roughly equivalent, maybe a bit less, than it was with
UFS+SU -- but it's fairly close.

The operating theory I have at the moment is that the ARC cache was in
some way getting into a near-deadlock situation with other memory
demands on the system (there IS a Postgres server running on this
hardware although it's a replication server and not taking queries --
nonetheless it does grab a chunk of RAM) leading to the stalls.
Limiting its grab of RAM appears to have to resolved the contention
issue.  I was unable to catch it actually running out of free memory
although it was consistently into the low five-digit free page count and
the kernel never garfed on the console about resource exhaustion --
other than a bitch about swap stalling (the infamous more than 20
seconds message.)  Page space in use near the time in question (I could
not get a display while locked as it went to I/O and froze) was not
zero, but pretty close to it (a few thousand blocks.)  That the system
was driven into light paging does appear to be significant and
indicative of some sort of memory contention issue as under operation
with UFS filesystems this machine has never been observed to allocate
page space.

Anyone seen anything like this before and if so

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Dennis Glatting
I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25%
) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips
without much load. Interestingly pbzip2 consistently created a problem
on a volume whereas gzip does not.

Here, stalls happen across several systems however I have had less
problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same
chips: IR vs IT) I don't have a problem.




On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote:
 Well now this is interesting.
 
 I have converted a significant number of filesystems to ZFS over the
 last week or so and have noted a few things.  A couple of them aren't so
 good.
 
 The subject machine in question has 12GB of RAM and dual Xeon
 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
 local cache and the BBU for it.  The ZFS spindles are all exported as
 JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
 partition added to them, are labeled and the providers are then
 geli-encrypted and added to the pool.  When the same disks were running
 on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
 adapter, exported as a single unit, GPT labeled as a single pack and
 then gpart-sliced and newfs'd under UFS+SU.
 
 Since I previously ran UFS filesystems on this config I know what the
 performance level I achieved with that, and the entire system had been
 running flawlessly set up that way for the last couple of years. 
 Presently the machine is running 9.1-Stable, r244942M
 
 Immediately after the conversion I set up a second pool to play with
 backup strategies to a single drive and ran into a problem.  The disk I
 used for that testing is one that previously was in the rotation and is
 also known good.  I began to get EXTENDED stalls with zero I/O going on,
 some lasting for 30 seconds or so.  The system was not frozen but
 anything that touched I/O would lock until it cleared.  Dedup is off,
 incidentally.
 
 My first thought was that I had a bad drive, cable or other physical
 problem.  However, searching for that proved fruitless -- there was
 nothing being logged anywhere -- not in the SMART data, not by the
 adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
 the +5V and +12V rails didn't disclose anything interesting with the
 power in the chassis; it's stable.  Further, swapping the only disk that
 had changed (the new backup volume) with a different one didn't change
 behavior either.
 
 The last straw was when I was able to reproduce the stalls WITHIN the
 original pool against the same four disks that had been running
 flawlessly for two years under UFS, and still couldn't find any evidence
 of a hardware problem (not even ECC-corrected data returns.)  All the
 disks involved are completely clean -- zero sector reassignments, the
 drive-specific log is clean, etc.
 
 Attempting to cut back the ARECA adapter's aggressiveness (buffering,
 etc) on the theory that I was tickling something in its cache management
 algorithm that was pissing it off proved fruitless as well, even when I
 shut off ALL caching and NCQ options.  I also set
 vfs.zfs.prefetch_disable=1 to no effect.  H...
 
 Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
 lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set
 vfs.zfs.write_limit_override to 102400 (1GB) and rebooted.  /*
 
 The problem instantly disappeared and I cannot provoke its return even
 with multiple full-bore snapshot and rsync filesystem copies running
 while a scrub is being done.*/
 /**/
 I'm pinging between being I/O and processor (geli) limited now in normal
 operation and slamming the I/O channel during a scrub.  It appears that
 performance is roughly equivalent, maybe a bit less, than it was with
 UFS+SU -- but it's fairly close.
 
 The operating theory I have at the moment is that the ARC cache was in
 some way getting into a near-deadlock situation with other memory
 demands on the system (there IS a Postgres server running on this
 hardware although it's a replication server and not taking queries --
 nonetheless it does grab a chunk of RAM) leading to the stalls. 
 Limiting its grab of RAM appears to have to resolved the contention
 issue.  I was unable to catch it actually running out of free memory
 although it was consistently into the low five-digit free page count and
 the kernel never garfed on the console about resource exhaustion --
 other than a bitch about swap stalling (the infamous more than 20
 seconds message.)  Page space in use near the time in question (I could
 not get a display while locked as it went to I/O and froze) was not
 zero, but pretty close to it (a few thousand blocks.)  That the system
 was driven into light paging does appear to be significant and
 indicative of some sort of memory contention issue as under operation
 with UFS filesystems this machine has never been observed to allocate
 page space.
 
 

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Karl Denninger
On 3/4/2013 6:33 PM, Steven Hartland wrote:
 What does zfs-stats -a show when your having the stall issue?

 You can also use zfs iostats to show individual disk iostats
 which may help identify a single failing disk e.g.
 zpool iostat -v 1

 Also have you investigated which of the two sysctls you changed
 fixed it or does it require both?

Regards
Steve

 - Original Message - From: Karl Denninger k...@denninger.net
 To: freebsd-stable@freebsd.org
 Sent: Monday, March 04, 2013 10:48 PM
 Subject: ZFS stalls -- and maybe we should be talking about defaults?


 Well now this is interesting.

 I have converted a significant number of filesystems to ZFS over the
 last week or so and have noted a few things.  A couple of them aren't so
 good.

 The subject machine in question has 12GB of RAM and dual Xeon
 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
 local cache and the BBU for it.  The ZFS spindles are all exported as
 JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
 partition added to them, are labeled and the providers are then
 geli-encrypted and added to the pool.  When the same disks were running
 on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
 adapter, exported as a single unit, GPT labeled as a single pack and
 then gpart-sliced and newfs'd under UFS+SU.

 Since I previously ran UFS filesystems on this config I know what the
 performance level I achieved with that, and the entire system had been
 running flawlessly set up that way for the last couple of years.
 Presently the machine is running 9.1-Stable, r244942M

 Immediately after the conversion I set up a second pool to play with
 backup strategies to a single drive and ran into a problem.  The disk I
 used for that testing is one that previously was in the rotation and is
 also known good.  I began to get EXTENDED stalls with zero I/O going on,
 some lasting for 30 seconds or so.  The system was not frozen but
 anything that touched I/O would lock until it cleared.  Dedup is off,
 incidentally.

 My first thought was that I had a bad drive, cable or other physical
 problem.  However, searching for that proved fruitless -- there was
 nothing being logged anywhere -- not in the SMART data, not by the
 adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
 the +5V and +12V rails didn't disclose anything interesting with the
 power in the chassis; it's stable.  Further, swapping the only disk that
 had changed (the new backup volume) with a different one didn't change
 behavior either.

 The last straw was when I was able to reproduce the stalls WITHIN the
 original pool against the same four disks that had been running
 flawlessly for two years under UFS, and still couldn't find any evidence
 of a hardware problem (not even ECC-corrected data returns.)  All the
 disks involved are completely clean -- zero sector reassignments, the
 drive-specific log is clean, etc.

 Attempting to cut back the ARECA adapter's aggressiveness (buffering,
 etc) on the theory that I was tickling something in its cache management
 algorithm that was pissing it off proved fruitless as well, even when I
 shut off ALL caching and NCQ options.  I also set
 vfs.zfs.prefetch_disable=1 to no effect.  H...

 Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
 lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set
 vfs.zfs.write_limit_override to 102400 (1GB) and rebooted.  /*

 The problem instantly disappeared and I cannot provoke its return even
 with multiple full-bore snapshot and rsync filesystem copies running
 while a scrub is being done.*/
 /**/
 I'm pinging between being I/O and processor (geli) limited now in normal
 operation and slamming the I/O channel during a scrub.  It appears that
 performance is roughly equivalent, maybe a bit less, than it was with
 UFS+SU -- but it's fairly close.

 The operating theory I have at the moment is that the ARC cache was in
 some way getting into a near-deadlock situation with other memory
 demands on the system (there IS a Postgres server running on this
 hardware although it's a replication server and not taking queries --
 nonetheless it does grab a chunk of RAM) leading to the stalls.
 Limiting its grab of RAM appears to have to resolved the contention
 issue.  I was unable to catch it actually running out of free memory
 although it was consistently into the low five-digit free page count and
 the kernel never garfed on the console about resource exhaustion --
 other than a bitch about swap stalling (the infamous more than 20
 seconds message.)  Page space in use near the time in question (I could
 not get a display while locked as it went to I/O and froze) was not
 zero, but pretty close to it (a few thousand blocks.)  That the system
 was driven into light paging does appear to be significant and
 indicative of some sort of memory contention issue as under operation
 with UFS

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Karl Denninger
Stick this in /boot/loader.conf and see if your lockups goes away:

vfs.zfs.write_limit_override=102400

I've got a sentinal running that watches for zero-bandwidth zpool
iostat 5s that has been running for close to 12 hours now and with the
two tunables I changed it doesn't appear to be happening any more.

This system always has small-ball write I/Os going to it as it's a
postgresql hot standby mirror backing a VERY active system and is
receiving streaming logdata from the primary at a colocation site, so
the odds of it ever experiencing an actual zero for I/O (unless there's
a connectivity problem) is pretty remote.

If it turns out that the write_limit_override tunable is the one
responsible for stopping the hangs I can drop the ARC limit tunable
although I'm not sure I want to; I don't see much if any performance
penalty from leaving it where it is and if the larger cache isn't
helping anything then why use it?  I'm inclined to stick an SSD in the
cabinet as a cache drive instead of dedicating RAM to this -- even
though it's not AS fast as RAM it's still MASSIVELY quicker than getting
data off a rotating plate of rust.

Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all?

On 3/4/2013 8:07 PM, Dennis Glatting wrote:
 I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25%
 ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips
 without much load. Interestingly pbzip2 consistently created a problem
 on a volume whereas gzip does not.

 Here, stalls happen across several systems however I have had less
 problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same
 chips: IR vs IT) I don't have a problem.




 On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote:
 Well now this is interesting.

 I have converted a significant number of filesystems to ZFS over the
 last week or so and have noted a few things.  A couple of them aren't so
 good.

 The subject machine in question has 12GB of RAM and dual Xeon
 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
 local cache and the BBU for it.  The ZFS spindles are all exported as
 JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
 partition added to them, are labeled and the providers are then
 geli-encrypted and added to the pool.  When the same disks were running
 on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
 adapter, exported as a single unit, GPT labeled as a single pack and
 then gpart-sliced and newfs'd under UFS+SU.

 Since I previously ran UFS filesystems on this config I know what the
 performance level I achieved with that, and the entire system had been
 running flawlessly set up that way for the last couple of years. 
 Presently the machine is running 9.1-Stable, r244942M

 Immediately after the conversion I set up a second pool to play with
 backup strategies to a single drive and ran into a problem.  The disk I
 used for that testing is one that previously was in the rotation and is
 also known good.  I began to get EXTENDED stalls with zero I/O going on,
 some lasting for 30 seconds or so.  The system was not frozen but
 anything that touched I/O would lock until it cleared.  Dedup is off,
 incidentally.

 My first thought was that I had a bad drive, cable or other physical
 problem.  However, searching for that proved fruitless -- there was
 nothing being logged anywhere -- not in the SMART data, not by the
 adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
 the +5V and +12V rails didn't disclose anything interesting with the
 power in the chassis; it's stable.  Further, swapping the only disk that
 had changed (the new backup volume) with a different one didn't change
 behavior either.

 The last straw was when I was able to reproduce the stalls WITHIN the
 original pool against the same four disks that had been running
 flawlessly for two years under UFS, and still couldn't find any evidence
 of a hardware problem (not even ECC-corrected data returns.)  All the
 disks involved are completely clean -- zero sector reassignments, the
 drive-specific log is clean, etc.

 Attempting to cut back the ARECA adapter's aggressiveness (buffering,
 etc) on the theory that I was tickling something in its cache management
 algorithm that was pissing it off proved fruitless as well, even when I
 shut off ALL caching and NCQ options.  I also set
 vfs.zfs.prefetch_disable=1 to no effect.  H...

 Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
 lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set
 vfs.zfs.write_limit_override to 102400 (1GB) and rebooted.  /*

 The problem instantly disappeared and I cannot provoke its return even
 with multiple full-bore snapshot and rsync filesystem copies running
 while a scrub is being done.*/
 /**/
 I'm pinging between being I/O and processor (geli) limited now in normal
 operation and slamming the I/O channel 

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Steven Hartland
- Original Message - 
From: Karl Denninger k...@denninger.net



Stick this in /boot/loader.conf and see if your lockups goes away:

vfs.zfs.write_limit_override=102400

...


If it turns out that the write_limit_override tunable is the one
responsible for stopping the hangs I can drop the ARC limit tunable
although I'm not sure I want to; I don't see much if any performance
penalty from leaving it where it is and if the larger cache isn't
helping anything then why use it?  I'm inclined to stick an SSD in the
cabinet as a cache drive instead of dedicating RAM to this -- even
though it's not AS fast as RAM it's still MASSIVELY quicker than getting
data off a rotating plate of rust.


Now interesting you should say that I've seen a stall recently on ZFS
only box running on 6 x SSD RAIDZ2.

The stall was caused by fairly large mysql import, with nothing else
running.

Then it happened I thought the machine had wedged, but minutes (not
seconds) later, everything sprung into action again.


Am I correct that a ZFS filesystem does NOT use the VM buffer cache
at all?


Correct

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Karl Denninger

On 3/4/2013 9:25 PM, Steven Hartland wrote:
 - Original Message - From: Karl Denninger k...@denninger.net

 Stick this in /boot/loader.conf and see if your lockups goes away:

 vfs.zfs.write_limit_override=102400
 ...

 If it turns out that the write_limit_override tunable is the one
 responsible for stopping the hangs I can drop the ARC limit tunable
 although I'm not sure I want to; I don't see much if any performance
 penalty from leaving it where it is and if the larger cache isn't
 helping anything then why use it?  I'm inclined to stick an SSD in the
 cabinet as a cache drive instead of dedicating RAM to this -- even
 though it's not AS fast as RAM it's still MASSIVELY quicker than getting
 data off a rotating plate of rust.

 Now interesting you should say that I've seen a stall recently on ZFS
 only box running on 6 x SSD RAIDZ2.

 The stall was caused by fairly large mysql import, with nothing else
 running.

 Then it happened I thought the machine had wedged, but minutes (not
 seconds) later, everything sprung into action again.

That's exactly what I can reproduce here; the stalls are anywhere from a
few seconds to well north of a half-minute.  It looks like the machine
is hung -- but it is not.

The machine in question normally runs with zero swap allocated but it
always has 1.5Gb of shared memory allocated to Postgres (shared_buffers
= 1500MB in its config file)

I wonder if the ARC cache management code is misbehaving when shared
segments are in use?

-- 
-- Karl Denninger
/The Market Ticker ®/ http://market-ticker.org
Cuda Systems LLC
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Dennis Glatting
On Mon, 2013-03-04 at 20:58 -0600, Karl Denninger wrote:
 Stick this in /boot/loader.conf and see if your lockups goes away:
 
 vfs.zfs.write_limit_override=102400
 

K.


 I've got a sentinal running that watches for zero-bandwidth zpool
 iostat 5s that has been running for close to 12 hours now and with the
 two tunables I changed it doesn't appear to be happening any more.
 

I've also done this as well as top and systat -vmstat. Disk I/O stops
but the system lives through top, system, and the network. However, if I
try to login the login won't complete.

All of my systems are hardware RAID1 for the OS (LSI and Areca) and
typically a separate disk for swap. All other disks are ZFS.

 This system always has small-ball write I/Os going to it as it's a
 postgresql hot standby mirror backing a VERY active system and is
 receiving streaming logdata from the primary at a colocation site, so
 the odds of it ever experiencing an actual zero for I/O (unless there's
 a connectivity problem) is pretty remote.
 

I am doing multi TB sorts and GB database loads.


 If it turns out that the write_limit_override tunable is the one
 responsible for stopping the hangs I can drop the ARC limit tunable
 although I'm not sure I want to; I don't see much if any performance
 penalty from leaving it where it is and if the larger cache isn't
 helping anything then why use it?  I'm inclined to stick an SSD in the
 cabinet as a cache drive instead of dedicating RAM to this -- even
 though it's not AS fast as RAM it's still MASSIVELY quicker than getting
 data off a rotating plate of rust.
 

I forgot to mention that on my three 8.3 systems they occasionally
offline a disk (one or two a week, total). I simply online the disk and
after resilver all is well. There are ~40 disks across those three
systems. Of my 9.1 systems three are busy but with smaller number of
disks (about eight across two volumes (RAIDz2 and mirror).

I also have a ZFS-on-Linux (CentOS) system for play (about 12 disks). It
did not exhibit problems when it was in use but it did teach me a lesson
on the evils of dedup. :)


 Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all?
 

Dunno.


 On 3/4/2013 8:07 PM, Dennis Glatting wrote:
  I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25%
  ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips
  without much load. Interestingly pbzip2 consistently created a problem
  on a volume whereas gzip does not.
 
  Here, stalls happen across several systems however I have had less
  problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same
  chips: IR vs IT) I don't have a problem.
 
 
 
 
  On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote:
  Well now this is interesting.
 
  I have converted a significant number of filesystems to ZFS over the
  last week or so and have noted a few things.  A couple of them aren't so
  good.
 
  The subject machine in question has 12GB of RAM and dual Xeon
  5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
  local cache and the BBU for it.  The ZFS spindles are all exported as
  JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
  partition added to them, are labeled and the providers are then
  geli-encrypted and added to the pool.  When the same disks were running
  on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
  adapter, exported as a single unit, GPT labeled as a single pack and
  then gpart-sliced and newfs'd under UFS+SU.
 
  Since I previously ran UFS filesystems on this config I know what the
  performance level I achieved with that, and the entire system had been
  running flawlessly set up that way for the last couple of years. 
  Presently the machine is running 9.1-Stable, r244942M
 
  Immediately after the conversion I set up a second pool to play with
  backup strategies to a single drive and ran into a problem.  The disk I
  used for that testing is one that previously was in the rotation and is
  also known good.  I began to get EXTENDED stalls with zero I/O going on,
  some lasting for 30 seconds or so.  The system was not frozen but
  anything that touched I/O would lock until it cleared.  Dedup is off,
  incidentally.
 
  My first thought was that I had a bad drive, cable or other physical
  problem.  However, searching for that proved fruitless -- there was
  nothing being logged anywhere -- not in the SMART data, not by the
  adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
  the +5V and +12V rails didn't disclose anything interesting with the
  power in the chassis; it's stable.  Further, swapping the only disk that
  had changed (the new backup volume) with a different one didn't change
  behavior either.
 
  The last straw was when I was able to reproduce the stalls WITHIN the
  original pool against the same four disks that had been running
  flawlessly for two years under UFS, and still 

Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Dennis Glatting
On Tue, 2013-03-05 at 03:25 +, Steven Hartland wrote:
 - Original Message - 
 From: Karl Denninger k...@denninger.net
 
  Stick this in /boot/loader.conf and see if your lockups goes away:
 
  vfs.zfs.write_limit_override=102400
 ...
 
  If it turns out that the write_limit_override tunable is the one
  responsible for stopping the hangs I can drop the ARC limit tunable
  although I'm not sure I want to; I don't see much if any performance
  penalty from leaving it where it is and if the larger cache isn't
  helping anything then why use it?  I'm inclined to stick an SSD in the
  cabinet as a cache drive instead of dedicating RAM to this -- even
  though it's not AS fast as RAM it's still MASSIVELY quicker than getting
  data off a rotating plate of rust.
 
 Now interesting you should say that I've seen a stall recently on ZFS
 only box running on 6 x SSD RAIDZ2.
 
 The stall was caused by fairly large mysql import, with nothing else
 running.
 
 Then it happened I thought the machine had wedged, but minutes (not
 seconds) later, everything sprung into action again.
 

I've seen this too.


  Am I correct that a ZFS filesystem does NOT use the VM buffer cache
  at all?
 
 Correct
 
 Regards
 Steve
 
 
 This e.mail is private and confidential between Multiplay (UK) Ltd. and the 
 person or entity to whom it is addressed. In the event of misdirection, the 
 recipient is prohibited from using, copying, printing or otherwise 
 disseminating it or any information contained in it. 
 
 In the event of misdirection, illegible or incomplete transmission please 
 telephone +44 845 868 1337
 or return the E.mail to postmas...@multiplay.co.uk.
 
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

-- 
Dennis Glatting d...@pki2.com

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Steven Hartland
- Original Message - 
From: Karl Denninger k...@denninger.net

Then it happened I thought the machine had wedged, but minutes (not
seconds) later, everything sprung into action again.


That's exactly what I can reproduce here; the stalls are anywhere from a
few seconds to well north of a half-minute.  It looks like the machine
is hung -- but it is not.


Out of interest when this happens for you is syncer using lots of CPU?

If its anything like my stalls you'll need top loaded prior to the fact.

   Regards
   Steve


This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to postmas...@multiplay.co.uk.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Karl Denninger

On 3/4/2013 10:01 PM, Steven Hartland wrote:
 - Original Message - From: Karl Denninger k...@denninger.net
 Then it happened I thought the machine had wedged, but minutes (not
 seconds) later, everything sprung into action again.

 That's exactly what I can reproduce here; the stalls are anywhere from a
 few seconds to well north of a half-minute.  It looks like the machine
 is hung -- but it is not.

 Out of interest when this happens for you is syncer using lots of CPU?

 If its anything like my stalls you'll need top loaded prior to the fact.

Regards
Steve
Don't know.  But the CPU is getting hammered when it happens because I
am geli-encrypting all my drives and as a consequence it is not at all
uncommon for the load average to be north of 10 when the system is under
heavy I/O load.  System response is fine right up until it stalls.

I'm going to put some effort into trying to isolate exactly what is
going on here in the coming days since I happen to have a spare box in
an identical configuration that I can afford to lock up without
impacting anyone doing real work :-)

-- 
-- Karl Denninger
/The Market Ticker ®/ http://market-ticker.org
Cuda Systems LLC
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Ben Morrow
Quoth Karl Denninger k...@denninger.net:
 
 Note that the machine is not booting from ZFS -- it is booting from and
 has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
 like a single da0 drive to the OS) and that drive stalls as well when
 it freezes.  It's definitely a kernel thing when it happens as the OS
 would otherwise not have locked (just I/O to the user partitions) -- but
 it does. 

Is it still the case that mixing UFS and ZFS can cause problems, or were
they all fixed? I remember a while ago (before the arc usage monitoring
code was added) there were a number of reports of serious probles
running an rsync from UFS to ZFS.

If you can it might be worth trying your scratch machine booting from
ZFS. Probably the best way is to leave your swap partition where it is
(IMHO it's not worth trying to swap onto a zvol) and convert the UFS
partition into a separate zpool to boot from. You will also need to
replace the boot blocks; assuming you're using GPT you can do this with
gpart bootcode -p /boot/gptzfsboot -i gpt boot partition.

Ben

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Jeremy Chadwick
On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote:
 Quoth Karl Denninger k...@denninger.net:
  
  Note that the machine is not booting from ZFS -- it is booting from and
  has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
  like a single da0 drive to the OS) and that drive stalls as well when
  it freezes.  It's definitely a kernel thing when it happens as the OS
  would otherwise not have locked (just I/O to the user partitions) -- but
  it does. 
 
 Is it still the case that mixing UFS and ZFS can cause problems, or were
 they all fixed? I remember a while ago (before the arc usage monitoring
 code was added) there were a number of reports of serious probles
 running an rsync from UFS to ZFS.

This problem still exists on stable/9.  The behaviour manifests itself
as fairly bad performance (I cannot remember if stalling or if just
throughput rates were awful).  I can only speculate as to what the root
cause is, but my guess is that it has something to do with the two
caching systems (UFS vs. ZFS ARC) fighting over large sums of memory.

The advice I've given people in the past is: if you do a LOT of I/O
between UFS and ZFS on the same box, it's time to move to 100% ZFS.
That said, I still do not recommend ZFS for a root filesystem (this
biting people still happens even today), and swap-on-ZFS is a huge
no-no.

I will note that I myself use pure UFS+SU (not SUJ) for my main OS
installation (that means /, swap, /var, /tmp, and /usr) on a dedicated
SSD, while everything else is ZFS raidz1 (no dedup, no compression;
won't ever enable these until that thread priority problem is fixed on
FreeBSD).

However, when I was migrating from gmirror+UFS+SU to ZFS, I witnessed
what I described in my 1st and 2nd paragraphs.  What userland utilities
were used (rsync vs. cp) made no difference; the problem is in the
kernel.

Footnote about this thread:

This thread contains all sorts of random pieces of information about
systems, with very little actual detail in them (barring the symptoms,
which are always useful to know!).

For example, just because your machine has 8 cores and 12GB of RAM
doesn't mean jack squat if some software in the kernel is designed
oddly.  Reworded: throwing more hardware at a problem solves nothing.

The most useful thing (for me) that I found was deep within the thread,
a few words along the lines of De-dup isn't used.  What about
compression, and if it's *ever* been enabled on the filesystem (even
if not presently enabled)?  It matters.  All this matters.

I see lots of end-users talking about these problems, but (barring
Steven) literally no kernel people who are in the know about ZFS
mentioning how said users can get them (devs) info that can help track
this down.  Those devs live on freebsd-fs@ and freebsd-hackers@, and not
too many read freebsd-stable@.

Step back for a moment and look at this anti-KISS configuration:

- Hardware RAID controller involved (Areca 1680ix)
- Hardware RAID controller has its own battery-backed cache (2GB)
- Therefore arcmsr(4) is involved -- revision of driver/OS build
  matters here, ditto with firmware version
- 4 disks are involved, models unknown
- Disks are GPT and are *partitioned, and ZFS refers to the partitions
  not the raw disk -- this matters (honest, it really does; the ZFS
  code handles things differently with raw disks)
- Providers are GELI-encrypted

Now ask yourself if any dev is really going to tackle this one given the
above mess.

My advice would be to get rid of the hardware RAID (go with Intel ICHxx
or ESBx on-board with AHCI), use raw disks for ZFS (if 4096-byte sector
disks use the gnop(8) method, which is a one-time thing), and get rid of
GELI.  If you can reproduce the problem there 100% of the time, awesome,
it's a clean/clear setup for someone to help investigate.

-- 
| Jeremy Chadwick   j...@koitsu.org |
| UNIX Systems Administratorhttp://jdc.koitsu.org/ |
| Mountain View, CA, US|
| Making life hard for others since 1977. PGP 4BD6C0CB |
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ZFS stalls -- and maybe we should be talking about defaults?

2013-03-04 Thread Garrett Wollman
In article 8c68812328e3483ba9786ef155911...@multiplay.co.uk,
kill...@multiplay.co.uk writes:

Now interesting you should say that I've seen a stall recently on ZFS
only box running on 6 x SSD RAIDZ2.

The stall was caused by fairly large mysql import, with nothing else
running.

Then it happened I thought the machine had wedged, but minutes (not
seconds) later, everything sprung into action again.

I have certainly seen what you might describe as stalls, caused, so
far as I can tell, by kernel memory starvation.  I've seen it take as
much as a half an hour to recover from these (which is too long for my
users).  Right now I have the ARC limited to 64 GB (on a 96 GB file
server) and that has made it more stable, but it's still not behaving
quite as I would like, and I'm looking to put more memory into the
system (to be used for non-ARC functions).  Looking at my munin
graphs, I find that backups in particular put very heavy pressure on,
doubling the UMA allocations over steady-state, and this takes about
four or five hours to climb back down.  See
http://people.freebsd.org/~wollman/vmstat_z-day.png for an example.

Some of the stalls are undoubtedly caused by internal fragmentation
rather than actual data in use.  (Solaris used to have this issue, and
some hooks were added to allow some amount of garbage collection with
the cooperation of the filesystem.)

-GAWollman

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org