Re: ZFS stalls -- and maybe we should be talking about defaults?
On Mar 5, 2013, at 11:09 PM, Jeremy Chadwick wrote: - Disks are GPT and are *partitioned, and ZFS refers to the partitions not the raw disk -- this matters (honest, it really does; the ZFS code handles things differently with raw disks) Not on FreeBSD as far I can see. My statement comes from here (first line in particular): http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html If this is wrong/false, then this furthers my point about kernel folks who are in-the-know needing to chime in and help stop the misinformation. The rest of us are just end-users, often misinformed. As far as I know, this is lore than surfaces periodically in the lists. It was true in Solaris (at least in the past). But unless I'm terribly wrong, this doesn't happen in FreeBSD. ZFS sees disks, and they can be a whole raw device or a partition/slice, even a gnop device. No difference. That's why I mentioned in freebsd-fs that we badly need an official doctrine, carefully curated, and written in holy letters ;) Borja. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/7/2013 1:21 AM, Peter Jeremy wrote: On 2013-Mar-04 16:48:18 -0600, Karl Denninger k...@denninger.net wrote: The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. What sort of disks? SAS or SATA? SATA. They're clean; they report no errors, no retries, no corrected data (ECC) etc. They also have been running for a couple of years under UFS+SU without problems. This isn't new hardware; it's an in-service system. also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. When the system has stalled: - Do you see very low free memory? Yes. Effectively zero. - What happens to all the different CPU utilisation figures? Do they all go to zero? Do you get high system or interrupt CPU (including going to 1 core's worth)? No, they start to fall. This is a bad piece of data to trust though because I am geli-encrypting the spindles, so falling CPU doesn't mean the CPU is actually idle (since with no I/O there is nothing going through geli.) I'm working on instrumenting things sufficiently to try to peel that off -- I suspect the kernel is spinning on something, but the trick is finding out what it is. - What happens to interrupt load? Do you see any disk controller interrupts? None. Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN) and see if you get any errors when stalls happen. If I have to. That's easy to do on the test box -- on the production one, not so much. On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick j...@koitsu.org wrote: On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote: Completely unrelated to the main thread: on 05/03/2013 07:32 Jeremy Chadwick said the following: That said, I still do not recommend ZFS for a root filesystem Why? Too long a history of problems with it and weird edge cases (keep reading); the last thing an administrator wants to deal with is a system where the root filesystem won't mount/can't be used. It makes recovery or problem-solving (i.e. the server is not physically accessible given geographic distances) very difficult. I've had lots of problems with a gmirrored UFS root as well. The biggest issue is that gmirror has no audit functionality so you can't verify that both sides of a mirror really do have the same data. I have root on a 2-drive RAID mirror (done in the controller) and that has been fine. The controller does scrubs on a regular basis internally. The problem is that if it gets a clean read that is different (e.g. no ECC indications, etc) it doesn't know which is the correct copy. The good news is that hasn't happened yet :-) The risk of this happening as my data store continues to expand is one of the reasons I want to move toward ZFS, but not necessarily for the boot drives. For the data store, however My point/opinion: UFS for a root filesystem is guaranteed to work without any fiddling about and, barring drive failures or controller issues, is (again, my opinion) a lot more risk-free than ZFS-on-root. AFAIK, you can't boot from anything other than a single disk (ie no graid). Where I am right now is this: 1. I *CANNOT* reproduce the spins on the test machine with Postgres stopped in any way. Even with multiple ZFS send/recv copies going on and the load average north of 20 (due to all the geli threads), the system doesn't stall or produce any notable pauses in throughput. Nor does the system RAM allocation get driven hard enough to force paging. This is with NO tuning hacks in /boot/loader.conf. I/O performance is both stable and solid. 2. WITH Postgres running as a connected hot spare (identical to the production machine), allocating ~1.5G of shared, wired memory, running the same synthetic workload in (1) above I am getting SMALL versions of the misbehavior. However, while system RAM allocation gets driven pretty hard and reaches down toward 100MB in some instances it doesn't get driven hard enough to allocate swap. The burstiness is very evident in the iostat figures with spates getting into the single digit MB/sec range from time to time but it's not enough to drive the system to a full-on stall. There's pretty-clearly a bad interaction here between Postgres wiring memory and the ARC, when the latter is left alone and allowed to do what it wants. I'm continuing to work on replicating this on the test machine... just not completely there yet. -- -- Karl Denninger /The Market Ticker ®/ http://market-ticker.org Cuda Systems LLC signature.asc Description: OpenPGP
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Karl Denninger k...@denninger.net Where I am right now is this: 1. I *CANNOT* reproduce the spins on the test machine with Postgres stopped in any way. Even with multiple ZFS send/recv copies going on and the load average north of 20 (due to all the geli threads), the system doesn't stall or produce any notable pauses in throughput. Nor does the system RAM allocation get driven hard enough to force paging. This is with NO tuning hacks in /boot/loader.conf. I/O performance is both stable and solid. 2. WITH Postgres running as a connected hot spare (identical to the production machine), allocating ~1.5G of shared, wired memory, running the same synthetic workload in (1) above I am getting SMALL versions of the misbehavior. However, while system RAM allocation gets driven pretty hard and reaches down toward 100MB in some instances it doesn't get driven hard enough to allocate swap. The burstiness is very evident in the iostat figures with spates getting into the single digit MB/sec range from time to time but it's not enough to drive the system to a full-on stall. There's pretty-clearly a bad interaction here between Postgres wiring memory and the ARC, when the latter is left alone and allowed to do what it wants. I'm continuing to work on replicating this on the test machine... just not completely there yet. Another possibility to consider is how postgres uses the FS. For example does is request sync IO in ways not present in the system without it which is causing the FS and possibly underlying disk system to behave differently. One other options to test, just to rule it out is what happens if you use BSD scheduler instead of ULE? Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/7/2013 12:57 PM, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net Where I am right now is this: 1. I *CANNOT* reproduce the spins on the test machine with Postgres stopped in any way. Even with multiple ZFS send/recv copies going on and the load average north of 20 (due to all the geli threads), the system doesn't stall or produce any notable pauses in throughput. Nor does the system RAM allocation get driven hard enough to force paging. This is with NO tuning hacks in /boot/loader.conf. I/O performance is both stable and solid. 2. WITH Postgres running as a connected hot spare (identical to the production machine), allocating ~1.5G of shared, wired memory, running the same synthetic workload in (1) above I am getting SMALL versions of the misbehavior. However, while system RAM allocation gets driven pretty hard and reaches down toward 100MB in some instances it doesn't get driven hard enough to allocate swap. The burstiness is very evident in the iostat figures with spates getting into the single digit MB/sec range from time to time but it's not enough to drive the system to a full-on stall. There's pretty-clearly a bad interaction here between Postgres wiring memory and the ARC, when the latter is left alone and allowed to do what it wants. I'm continuing to work on replicating this on the test machine... just not completely there yet. Another possibility to consider is how postgres uses the FS. For example does is request sync IO in ways not present in the system without it which is causing the FS and possibly underlying disk system to behave differently. That's possible but not terribly-likely in this particular instance. The reason is that I ran into this with the Postgres data store on a UFS volume BEFORE I converted it. Now it's on the ZFS pool (with recordsize=8k as recommended for that filesystem) but when I first ran into this it was on a separate UFS filesystem (which is where it had resided for 2+ years without incident), so unless the Postgres filesystem use on a UFS volume would give ZFS fits it's unlikely to be involved. One other options to test, just to rule it out is what happens if you use BSD scheduler instead of ULE? Regards Steve I will test that but first I have to get the test machine to reliably stall so I know I'm not chasing my tail. -- -- Karl Denninger /The Market Ticker ®/ http://market-ticker.org Cuda Systems LLC ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Karl Denninger k...@denninger.net To: freebsd-stable@freebsd.org Sent: Thursday, March 07, 2013 7:07 PM Subject: Re: ZFS stalls -- and maybe we should be talking about defaults? On 3/7/2013 12:57 PM, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net Where I am right now is this: 1. I *CANNOT* reproduce the spins on the test machine with Postgres stopped in any way. Even with multiple ZFS send/recv copies going on and the load average north of 20 (due to all the geli threads), the system doesn't stall or produce any notable pauses in throughput. Nor does the system RAM allocation get driven hard enough to force paging. This is with NO tuning hacks in /boot/loader.conf. I/O performance is both stable and solid. 2. WITH Postgres running as a connected hot spare (identical to the production machine), allocating ~1.5G of shared, wired memory, running the same synthetic workload in (1) above I am getting SMALL versions of the misbehavior. However, while system RAM allocation gets driven pretty hard and reaches down toward 100MB in some instances it doesn't get driven hard enough to allocate swap. The burstiness is very evident in the iostat figures with spates getting into the single digit MB/sec range from time to time but it's not enough to drive the system to a full-on stall. There's pretty-clearly a bad interaction here between Postgres wiring memory and the ARC, when the latter is left alone and allowed to do what it wants. I'm continuing to work on replicating this on the test machine... just not completely there yet. Another possibility to consider is how postgres uses the FS. For example does is request sync IO in ways not present in the system without it which is causing the FS and possibly underlying disk system to behave differently. That's possible but not terribly-likely in this particular instance. The reason is that I ran into this with the Postgres data store on a UFS volume BEFORE I converted it. Now it's on the ZFS pool (with recordsize=8k as recommended for that filesystem) but when I first ran into this it was on a separate UFS filesystem (which is where it had resided for 2+ years without incident), so unless the Postgres filesystem use on a UFS volume would give ZFS fits it's unlikely to be involved. I hate to say it, but that sounds very familiar to something we experienced with a machine here which was running high numbers of rrd updates. Again we had the issue on UFS and saw the same thing when we moved the ZFS. I'll leave that there as to not derail the investigation with what could be totally irrelavent info, but it may prove an interesting data point later. There are obvious common low level points between UFS and ZFS which may be the cause. One area which springs to mind is device bio ordering and barriers which could well be impacted by sync IO requests independent of the FS in use. One other options to test, just to rule it out is what happens if you use BSD scheduler instead of ULE? I will test that but first I have to get the test machine to reliably stall so I know I'm not chasing my tail. Very sensible. Assuming you can reproduce it, one thing that might be interesting to try is to eliminate all sync IO. I'm not sure if there are options in Postgres to do this via configuration or if it would require editing the code but this could reduce the problem space. If disabling sync IO eliminated the problem it would go a long way to proving it isn't the IO volume or pattern per say but instead related to the sync nature of said IO. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/7/2013 1:27 PM, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net To: freebsd-stable@freebsd.org Sent: Thursday, March 07, 2013 7:07 PM Subject: Re: ZFS stalls -- and maybe we should be talking about defaults? On 3/7/2013 12:57 PM, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net Where I am right now is this: 1. I *CANNOT* reproduce the spins on the test machine with Postgres stopped in any way. Even with multiple ZFS send/recv copies going on and the load average north of 20 (due to all the geli threads), the system doesn't stall or produce any notable pauses in throughput. Nor does the system RAM allocation get driven hard enough to force paging. This is with NO tuning hacks in /boot/loader.conf. I/O performance is both stable and solid. 2. WITH Postgres running as a connected hot spare (identical to the production machine), allocating ~1.5G of shared, wired memory, running the same synthetic workload in (1) above I am getting SMALL versions of the misbehavior. However, while system RAM allocation gets driven pretty hard and reaches down toward 100MB in some instances it doesn't get driven hard enough to allocate swap. The burstiness is very evident in the iostat figures with spates getting into the single digit MB/sec range from time to time but it's not enough to drive the system to a full-on stall. There's pretty-clearly a bad interaction here between Postgres wiring memory and the ARC, when the latter is left alone and allowed to do what it wants. I'm continuing to work on replicating this on the test machine... just not completely there yet. Another possibility to consider is how postgres uses the FS. For example does is request sync IO in ways not present in the system without it which is causing the FS and possibly underlying disk system to behave differently. That's possible but not terribly-likely in this particular instance. The reason is that I ran into this with the Postgres data store on a UFS volume BEFORE I converted it. Now it's on the ZFS pool (with recordsize=8k as recommended for that filesystem) but when I first ran into this it was on a separate UFS filesystem (which is where it had resided for 2+ years without incident), so unless the Postgres filesystem use on a UFS volume would give ZFS fits it's unlikely to be involved. I hate to say it, but that sounds very familiar to something we experienced with a machine here which was running high numbers of rrd updates. Again we had the issue on UFS and saw the same thing when we moved the ZFS. I'll leave that there as to not derail the investigation with what could be totally irrelavent info, but it may prove an interesting data point later. There are obvious common low level points between UFS and ZFS which may be the cause. One area which springs to mind is device bio ordering and barriers which could well be impacted by sync IO requests independent of the FS in use. One other options to test, just to rule it out is what happens if you use BSD scheduler instead of ULE? I will test that but first I have to get the test machine to reliably stall so I know I'm not chasing my tail. Very sensible. Assuming you can reproduce it, one thing that might be interesting to try is to eliminate all sync IO. I'm not sure if there are options in Postgres to do this via configuration or if it would require editing the code but this could reduce the problem space. If disabling sync IO eliminated the problem it would go a long way to proving it isn't the IO volume or pattern per say but instead related to the sync nature of said IO. That can be turned off in the Postgres configuration. For obvious reasons it's a very bad idea but it is able to be disabled without actually changing the code itself. I don't know if it shuts off ALL sync requests, but the documentation says it does. It's interesting that you ran into this with RRD going; the machine in question does pull RRD data for Cacti, but it's such a small piece of the total load profile that I considered it immaterial. It might not be. -- -- Karl Denninger /The Market Ticker ®/ http://market-ticker.org Cuda Systems LLC ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Karl Denninger k...@denninger.net I will test that but first I have to get the test machine to reliably stall so I know I'm not chasing my tail. Very sensible. Assuming you can reproduce it, one thing that might be interesting to try is to eliminate all sync IO. I'm not sure if there are options in Postgres to do this via configuration or if it would require editing the code but this could reduce the problem space. If disabling sync IO eliminated the problem it would go a long way to proving it isn't the IO volume or pattern per say but instead related to the sync nature of said IO. That can be turned off in the Postgres configuration. For obvious reasons it's a very bad idea but it is able to be disabled without actually changing the code itself. I don't know if it shuts off ALL sync requests, but the documentation says it does. It's interesting that you ran into this with RRD going; the machine in question does pull RRD data for Cacti, but it's such a small piece of the total load profile that I considered it immaterial. It might not be. We never did get to the bottom of it but did come up with a fix. Instead of using straight RRD interaction we switched all out code to use rrdcached and put the files on SSD based pool, never had an issue since. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 06.03.13 02:42, Steven Hartland wrote: - Original Message - From: Daniel Kalchev On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote: I say that knowing lots of people use ZFS-on-root, which is great -- I just wonder how many of them have tested all the crazy scenarios and then tried to boot from things. I have verified that ZFS-on-root works reliably in all of the following scenarios: single disk, one mirror vdev, many mirror vdevs, raidz. Haven't found the time to test many raidz vdevs, I admit. :) One thing to watch out for is the available BIOS boot disks. If you try to do a large RAIDZ with lots of disk as you root pool your likely to run into problems not because of any ZFS issue but simply because the disks the BIOS sees and hence tries to boot may not be what you expect. A prudent system administrator should understand this issue and verify that whatever (boot) architecture they come up with, is supported by their particular hardware and firmware. This is no different for ZFS than for any other case. The 2nd stage boot from ZFS loader in FreeBSD could in fact end up with it's own drive detection code one day, which will eliminate it's dependence on BIOS at all. For relatively small systems, where the administrator might be careless enough to not consider all scenarios, today's BIOSes already provide support for enough devices (e.n. most motherboards provide 4-6 SATA ports etc). Using separate boot pools of just few devices is what I do for large storage boxes too. Mostly because I want to be able to fiddle with data disks without caring that might impact the OS. Just make sure the BIOS does see these in the drives list it creates. That is, don't put the boot disks at the last positions in your chassis :) -- use the on-board SATA slots that are scanned first -- sadly, almost every vendor provides for such drives placed inside the chassis, which makes it very inconvenient if one of the drives dies. Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
Karl Denninger wrote this message on Tue, Mar 05, 2013 at 06:56 -0600: When it happens on my system anything that is CPU-bound continues to execute. I can switch consoles and network I/O also works. If I have an iostat running at the time all I/O counters go to and remain at zero while the stall is occurring, but the process that is producing the iostat continues to run and emit characters whether it is a ssh session or on the physical console. The CPUs are running and processing, but all threads block if they attempt access to the disk I/O subsystem, irrespective of the portion of the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS) I therefore cannot start any new process that requires image activation. Since it seems like there is a thread that is spinning... Has anyone thought to modify kgdb to mlockall it's memory and run it against the current system (kgdb /boot/kernel/kernel /dev/mem), and then when the thread goes busy, use kgdb to see what where it's spinning? Just a thought... -- John-Mark Gurney Voice: +1 415 225 5579 All that I will do, has been done, All that I have, has not. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 2013-Mar-04 16:48:18 -0600, Karl Denninger k...@denninger.net wrote: The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. What sort of disks? SAS or SATA? also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. When the system has stalled: - Do you see very low free memory? - What happens to all the different CPU utilisation figures? Do they all go to zero? Do you get high system or interrupt CPU (including going to 1 core's worth)? - What happens to interrupt load? Do you see any disk controller interrupts? Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN) and see if you get any errors when stalls happen. On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick j...@koitsu.org wrote: On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote: Completely unrelated to the main thread: on 05/03/2013 07:32 Jeremy Chadwick said the following: That said, I still do not recommend ZFS for a root filesystem Why? Too long a history of problems with it and weird edge cases (keep reading); the last thing an administrator wants to deal with is a system where the root filesystem won't mount/can't be used. It makes recovery or problem-solving (i.e. the server is not physically accessible given geographic distances) very difficult. I've had lots of problems with a gmirrored UFS root as well. The biggest issue is that gmirror has no audit functionality so you can't verify that both sides of a mirror really do have the same data. My point/opinion: UFS for a root filesystem is guaranteed to work without any fiddling about and, barring drive failures or controller issues, is (again, my opinion) a lot more risk-free than ZFS-on-root. AFAIK, you can't boot from anything other than a single disk (ie no graid). -- Peter Jeremy pgp7H3m449swl.pgp Description: PGP signature
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Jeremy Chadwick j...@koitsu.org To: Ben Morrow b...@morrow.me.uk Cc: freebsd-stable@freebsd.org Sent: Tuesday, March 05, 2013 5:32 AM Subject: Re: ZFS stalls -- and maybe we should be talking about defaults? On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote: Quoth Karl Denninger k...@denninger.net: Note that the machine is not booting from ZFS -- it is booting from and has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks like a single da0 drive to the OS) and that drive stalls as well when it freezes. It's definitely a kernel thing when it happens as the OS would otherwise not have locked (just I/O to the user partitions) -- but it does. Is it still the case that mixing UFS and ZFS can cause problems, or were they all fixed? I remember a while ago (before the arc usage monitoring code was added) there were a number of reports of serious probles running an rsync from UFS to ZFS. This problem still exists on stable/9. The behaviour manifests itself as fairly bad performance (I cannot remember if stalling or if just throughput rates were awful). I can only speculate as to what the root cause is, but my guess is that it has something to do with the two caching systems (UFS vs. ZFS ARC) fighting over large sums of memory. In our case we have no UFS, so this isn't the cause of the stalls. Spec here is * 64GB RAM * LSI 2008 * 8.3-RELEASE * Pure ZFS * Trigger MySQL doing a DB import, nothing else running. * 4K disk alignment Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 09:12:47AM -, Steven Hartland wrote: - Original Message - From: Jeremy Chadwick j...@koitsu.org To: Ben Morrow b...@morrow.me.uk Cc: freebsd-stable@freebsd.org Sent: Tuesday, March 05, 2013 5:32 AM Subject: Re: ZFS stalls -- and maybe we should be talking about defaults? On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote: Quoth Karl Denninger k...@denninger.net: Note that the machine is not booting from ZFS -- it is booting from and has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks like a single da0 drive to the OS) and that drive stalls as well when it freezes. It's definitely a kernel thing when it happens as the OS would otherwise not have locked (just I/O to the user partitions) -- but it does. Is it still the case that mixing UFS and ZFS can cause problems, or were they all fixed? I remember a while ago (before the arc usage monitoring code was added) there were a number of reports of serious probles running an rsync from UFS to ZFS. This problem still exists on stable/9. The behaviour manifests itself as fairly bad performance (I cannot remember if stalling or if just throughput rates were awful). I can only speculate as to what the root cause is, but my guess is that it has something to do with the two caching systems (UFS vs. ZFS ARC) fighting over large sums of memory. In our case we have no UFS, so this isn't the cause of the stalls. Spec here is * 64GB RAM * LSI 2008 * 8.3-RELEASE * Pure ZFS * Trigger MySQL doing a DB import, nothing else running. * 4K disk alignment 1. Is compression enabled? Has it ever been enabled (on any fs) in the past (barring pool being destroyed + recreated)? 2. Is dedup enabled? Has it ever been enabled (on any fs) in the past (barring pool being destroyed + recreated)? I can speculate day and night about what could cause this kind of issue, honestly. The possibilities are quite literally infinite, and all of them require folks deeply familiar with both FreeBSD's ZFS as well as very key/major parts of the kernel (ranging from VM to interrupt handlers to I/O subsystem). (This next comment isn't for you, Steve, you already know this :-) ) The way different pieces of the kernel interact with one another is fairly complex; the kernel is not simple. Things I think that might prove useful: * Describing the stall symptoms; what all does it impact? Can you switch VTYs on console when its happening? Network I/O (e.g. SSH'd into the same box and just holding down a letter) showing stalls then catching up? Things of this nature. * How long the stall is in duration (ex. if there's some way to roughly calculate this using date in a shell script) * Contents of /etc/sysctl.conf and /boot/loader.conf (re: tweaking of the system) * sysctl -a | grep zfs before and after a stall -- do not bother with those ARC summaries scripts please, at least not for this * vmstat -z before and after a stall * vmstat -m before and after a stall * vmstat -s before and after a stall * vmstat -i before, after, AND during a stall Basically, every person who experiences this problem needs to treat every situation uniquely -- no me too -- and try to find reliable 100% test cases for it. That's the only way bugs of this nature (i.e. of a complex nature) get fixed. -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Mountain View, CA, US| | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
Completely unrelated to the main thread: on 05/03/2013 07:32 Jeremy Chadwick said the following: That said, I still do not recommend ZFS for a root filesystem Why? (this biting people still happens even today) What exactly? - Disks are GPT and are *partitioned, and ZFS refers to the partitions not the raw disk -- this matters (honest, it really does; the ZFS code handles things differently with raw disks) Not on FreeBSD as far I can see. P.S. I completely agree with your suggestions on simplifying the setup and gathering objective information for the purpose of debugging the issue. I also completely agree that me too-ing is not very useful (and often completely incorrect) for the complex problems like this one. Thank you. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/5/2013 3:27 AM, Jeremy Chadwick wrote: On Tue, Mar 05, 2013 at 09:12:47AM -, Steven Hartland wrote: - Original Message - From: Jeremy Chadwick j...@koitsu.org To: Ben Morrow b...@morrow.me.uk Cc: freebsd-stable@freebsd.org Sent: Tuesday, March 05, 2013 5:32 AM Subject: Re: ZFS stalls -- and maybe we should be talking about defaults? On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote: Quoth Karl Denninger k...@denninger.net: Note that the machine is not booting from ZFS -- it is booting from and has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks like a single da0 drive to the OS) and that drive stalls as well when it freezes. It's definitely a kernel thing when it happens as the OS would otherwise not have locked (just I/O to the user partitions) -- but it does. Is it still the case that mixing UFS and ZFS can cause problems, or were they all fixed? I remember a while ago (before the arc usage monitoring code was added) there were a number of reports of serious probles running an rsync from UFS to ZFS. This problem still exists on stable/9. The behaviour manifests itself as fairly bad performance (I cannot remember if stalling or if just throughput rates were awful). I can only speculate as to what the root cause is, but my guess is that it has something to do with the two caching systems (UFS vs. ZFS ARC) fighting over large sums of memory. In our case we have no UFS, so this isn't the cause of the stalls. Spec here is * 64GB RAM * LSI 2008 * 8.3-RELEASE * Pure ZFS * Trigger MySQL doing a DB import, nothing else running. * 4K disk alignment 1. Is compression enabled? Has it ever been enabled (on any fs) in the past (barring pool being destroyed + recreated)? 2. Is dedup enabled? Has it ever been enabled (on any fs) in the past (barring pool being destroyed + recreated)? I can speculate day and night about what could cause this kind of issue, honestly. The possibilities are quite literally infinite, and all of them require folks deeply familiar with both FreeBSD's ZFS as well as very key/major parts of the kernel (ranging from VM to interrupt handlers to I/O subsystem). (This next comment isn't for you, Steve, you already know this :-) ) The way different pieces of the kernel interact with one another is fairly complex; the kernel is not simple. Things I think that might prove useful: * Describing the stall symptoms; what all does it impact? Can you switch VTYs on console when its happening? Network I/O (e.g. SSH'd into the same box and just holding down a letter) showing stalls then catching up? Things of this nature. When it happens on my system anything that is CPU-bound continues to execute. I can switch consoles and network I/O also works. If I have an iostat running at the time all I/O counters go to and remain at zero while the stall is occurring, but the process that is producing the iostat continues to run and emit characters whether it is a ssh session or on the physical console. The CPUs are running and processing, but all threads block if they attempt access to the disk I/O subsystem, irrespective of the portion of the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS) I therefore cannot start any new process that requires image activation. * How long the stall is in duration (ex. if there's some way to roughly calculate this using date in a shell script) They're variable. Some last fractions of a second and are not really all that noticeable unless you happen to be paying CLOSE attention. Some last a few (5 or so) seconds. The really bad ones last long enough that the kernel throws the message swap_pager: indefinite wait buffer. The machine in the general sense never pages. It contains 12GB of RAM but historically (prior to ZFS being put into service) always showed 0 for a pstat -s, although it does have a 20g raw swap partition (to /dev/da0s1b, not to a zpool) allocated. During the stalls I cannot run a pstat (I tried; it stalls) but when it unlocks I find that there is swap allocated, albeit not a ridiculous amount. ~20,000 pages or so have made it to the swap partition. This is not behavior that I had seen before on this machine prior to the stall problem, and with the two tuning tweaks discussed here I'm now up to 48 hours without any allocation to swap (or any stalls.) * Contents of /etc/sysctl.conf and /boot/loader.conf (re: tweaking of the system) /boot/loader.conf: kern.ipc.semmni=256 kern.ipc.semmns=512 kern.ipc.semmnu=256 geom_eli_load=YES sound_load=YES # # Limit to physical CPU count for threads # kern.geom.eli.threads=8 # # ZFS Prefetch does help, although you'd think it would not due to the adapter # doing it already. Wrong guess; it's good for 2x the performance. # We limit the ARC to 2GB of RAM and the TXG write limit to 1GB. # #vfs.zfs.prefetch_disable=1 vfs.zfs.arc_max=20 vfs.zfs.write_limit_override=102400
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 12:40:38AM -0500, Garrett Wollman wrote: In article 8c68812328e3483ba9786ef155911...@multiplay.co.uk, kill...@multiplay.co.uk writes: Now interesting you should say that I've seen a stall recently on ZFS only box running on 6 x SSD RAIDZ2. The stall was caused by fairly large mysql import, with nothing else running. Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. I have certainly seen what you might describe as stalls, caused, so far as I can tell, by kernel memory starvation. I've seen it take as much as a half an hour to recover from these (which is too long for my users). Right now I have the ARC limited to 64 GB (on a 96 GB file server) and that has made it more stable, but it's still not behaving quite as I would like, and I'm looking to put more memory into the system (to be used for non-ARC functions). Looking at my munin graphs, I find that backups in particular put very heavy pressure on, doubling the UMA allocations over steady-state, and this takes about four or five hours to climb back down. See http://people.freebsd.org/~wollman/vmstat_z-day.png for an example. Some of the stalls are undoubtedly caused by internal fragmentation rather than actual data in use. (Solaris used to have this issue, and some hooks were added to allow some amount of garbage collection with the cooperation of the filesystem.) Just as a note that there was a page I read in the past few months that pointed out that having a huge ARC may not always be in the best interests of the system. Some operation on the filesystem (I forget what, apologies) caused the system to churn through the ARC and discard most of it, while regular I/O was blocked Unfortunately I cannot remember where I found that page now and I don't appear to have bookmarked it From what has been said in this thread I'm not convinced that people are hitting this issue, however I would like to raise it for consideration Regards, Gary ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 5, 2013 at 7:22 AM, Gary Palmer gpal...@freebsd.org wrote: Just as a note that there was a page I read in the past few months that pointed out that having a huge ARC may not always be in the best interests of the system. Some operation on the filesystem (I forget what, apologies) caused the system to churn through the ARC and discard most of it, while regular I/O was blocked Huh. What timing. I've been fighting with our largest ZFS box (128 GB of RAM, 16 CPU cores, 2x SSD for SLOG, 2x SSD for L2ARC, 45x 2 TB HD for pool in 6-driive raidz2 vdevs) for the past week trying to figure out why ZFS send/recv just hangs after awhile. Everything is stuck in D in ps ax output, and top show the l2arc_feed_ thread using 100% of one CPU. Even removing the L2ARC devices from the pool doesn't help, just slows the amount of time until the hang. ARC was configured for 120 GB, with arc_meta_limit set to 90 GB. Yes, dedup and compression are enabled (it's a backups storage box, and we get over 5x combined dedup/compress ratio). After several hours of running, the ARC and wired would get up to 100+ GB, and the box would spend most of its time spinning, with almost 0 I/O to the pool (only a few KB/s of reads in zpool iostat 1 or gstat). ZFS send/recv would eventually complete, but what used to take 15-20 minutes would take 6-8 hours to complete. I've reduced the ARC to only 32 GB, with arc_meta set to 28 GB, and things are running much smoother now (50-200 MB/s writes for 3-5 seconds every 10s), and send/recv is back down to 10-15 minutes. Who would have thought too much RAM would be an issue? Will play with this over the next couple of days with different ARC max settings to see where the problems start. All of our ZFS boxes until this one had under 64 GB of RAM. (And we had issues with dedupe enabled on boxes with too little RAM, as in under 32 GB.) -- Freddie Cash fjwc...@gmail.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote: Completely unrelated to the main thread: on 05/03/2013 07:32 Jeremy Chadwick said the following: That said, I still do not recommend ZFS for a root filesystem Why? Too long a history of problems with it and weird edge cases (keep reading); the last thing an administrator wants to deal with is a system where the root filesystem won't mount/can't be used. It makes recovery or problem-solving (i.e. the server is not physically accessible given geographic distances) very difficult. Are there still issues booting from raidzX or stripes or root pools with multiple vdevs? What about with cache or log devices? My point/opinion: UFS for a root filesystem is guaranteed to work without any fiddling about and, barring drive failures or controller issues, is (again, my opinion) a lot more risk-free than ZFS-on-root. I say that knowing lots of people use ZFS-on-root, which is great -- I just wonder how many of them have tested all the crazy scenarios and then tried to boot from things. (this biting people still happens even today) What exactly? http://lists.freebsd.org/pipermail/freebsd-questions/2013-February/249363.html http://lists.freebsd.org/pipermail/freebsd-questions/2013-February/249387.html http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072398.html The last one got solved: http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072406.html http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072408.html I know factually you're aware of the zpool.cache ordeal (which may or may not be the cause of the issue shown in the 2nd URL above), but my point is that still at this moment in time -- barring someone using a stable/9 ISO for installation -- there still seem to be issues. Things on the mailing lists which go unanswered/never provide closure of this nature are numerous, and that just adds to my concern. - Disks are GPT and are *partitioned, and ZFS refers to the partitions not the raw disk -- this matters (honest, it really does; the ZFS code handles things differently with raw disks) Not on FreeBSD as far I can see. My statement comes from here (first line in particular): http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html If this is wrong/false, then this furthers my point about kernel folks who are in-the-know needing to chime in and help stop the misinformation. The rest of us are just end-users, often misinformed. -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Mountain View, CA, US| | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 5, 2013 at 2:09 PM, Jeremy Chadwick j...@koitsu.org wrote: On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote: - Disks are GPT and are *partitioned, and ZFS refers to the partitions not the raw disk -- this matters (honest, it really does; the ZFS code handles things differently with raw disks) Not on FreeBSD as far I can see. My statement comes from here (first line in particular): http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html If this is wrong/false, then this furthers my point about kernel folks who are in-the-know needing to chime in and help stop the misinformation. The rest of us are just end-users, often misinformed. This has been false from the very first import of ZFS into FreeBSD 7-STABLE. Pawel even mentions that GEOM allows the use of the cache on partitions with ZFS somewhere around that time frame. Considering he did the initial import of ZFS into FreeBSD, I don't think you can find a more canonical answer. :) This is one of the biggest differences between the Solaris-based ZFS and the FreeBSD-based ZFS. It's too bad this mis-information has basically become a meme. :( -- Freddie Cash fjwc...@gmail.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 02:18:30PM -0800, Freddie Cash wrote: On Tue, Mar 5, 2013 at 2:09 PM, Jeremy Chadwick j...@koitsu.org wrote: On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote: - Disks are GPT and are *partitioned, and ZFS refers to the partitions not the raw disk -- this matters (honest, it really does; the ZFS code handles things differently with raw disks) Not on FreeBSD as far I can see. My statement comes from here (first line in particular): http://lists.freebsd.org/pipermail/freebsd-questions/2013-January/248697.html If this is wrong/false, then this furthers my point about kernel folks who are in-the-know needing to chime in and help stop the misinformation. The rest of us are just end-users, often misinformed. This has been false from the very first import of ZFS into FreeBSD 7-STABLE. Pawel even mentions that GEOM allows the use of the cache on partitions with ZFS somewhere around that time frame. Considering he did the initial import of ZFS into FreeBSD, I don't think you can find a more canonical answer. :) This is one of the biggest differences between the Solaris-based ZFS and the FreeBSD-based ZFS. This is good (excellent) information to know -- thank you for clearing that up. It's too bad this mis-information has basically become a meme. :( Such is the case with FreeBSD's ZFS in general, solely because of the fact that the number of people who can answer the deep technical questions are few. -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Mountain View, CA, US| | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote: I say that knowing lots of people use ZFS-on-root, which is great -- I just wonder how many of them have tested all the crazy scenarios and then tried to boot from things. I have verified that ZFS-on-root works reliably in all of the following scenarios: single disk, one mirror vdev, many mirror vdevs, raidz. Haven't found the time to test many raidz vdevs, I admit. :) Combined with boot environments (that can be served many different ways), ZFS on root is short of a miracle. ZFS on FreeBSD has some issues, mostly with huge installations and defaults/tuning, but not really with ZFS-on-root. Of course, if for example, you follow stable, you should be prepared with alternative boot media that supports the current zpool/zfs versions. But this is small cost to pay. Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Mar 5, 2013, at 8:17 PM, Freddie Cash fjwc...@gmail.com wrote: ZFS send/recv would eventually complete, but what used to take 15-20 minutes would take 6-8 hours to complete. I've reduced the ARC to only 32 GB, with arc_meta set to 28 GB, and things are running much smoother now (50-200 MB/s writes for 3-5 seconds every 10s), and send/recv is back down to 10-15 minutes. Who would have thought too much RAM would be an issue? Will play with this over the next couple of days with different ARC max settings to see where the problems start. All of our ZFS boxes until this one had under 64 GB of RAM. (And we had issues with dedupe enabled on boxes with too little RAM, as in under 32 GB.) I have an archive box running very similar setup as yours, but with 72GB of RAM. I have set both arc_max and arc_meta_limit to 64GB, with no issues. I am still doing a very complex snapshot reordering between two pools. One of the pools has dedup enabled (which prompted me to add RAM), with dedup ratio f over 10x and there are still no issues or any stalling. The other pool has both dedup and compression for some filesystems. My only issue is that replacing a drive in either pool takes few days (6-drive vdevs of 3TB drives). Perhaps the memory indexing/search algorithms are inefficient? Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Daniel Kalchev On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote: I say that knowing lots of people use ZFS-on-root, which is great -- I just wonder how many of them have tested all the crazy scenarios and then tried to boot from things. I have verified that ZFS-on-root works reliably in all of the following scenarios: single disk, one mirror vdev, many mirror vdevs, raidz. Haven't found the time to test many raidz vdevs, I admit. :) One thing to watch out for is the available BIOS boot disks. If you try to do a large RAIDZ with lots of disk as you root pool your likely to run into problems not because of any ZFS issue but simply because the disks the BIOS sees and hence tries to boot may not be what you expect. It won't nessacarily hit you when you first install either, add more disks at a later date to an multi controller LSI 2008 machine and you can end up with not being able to specify the correct set of disks in the bios. Yes learned that one the hard way :( For larger storage boxes we've taken to using two SSD's paritioned and used as the boot, ZIL as neither requires a massive amount space they are a nice fit together. Combined with boot environments (that can be served many different ways), ZFS on root is short of a miracle. ZFS on FreeBSD has some issues, mostly with huge installations and defaults/tuning, but not really with ZFS-on-root. Of course, if for example, you follow stable, you should be prepared with alternative boot media that supports the current zpool/zfs versions. But this is small cost to pay. For anyone looking to do a zfs only install I would definitely recommend they look at:- http://mfsbsd.vx.sk/ this little gem + custom script for our env and it takes a few mins from boot to installed machine. Its also our go to rescue disk, forget messing around with the standard ISO's and their rescue option which never worked for me when I needed it, this is fully work OS with all the tools you'll want when things go wrong and if there is something missing its easy to compile and build your own version. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
Quoth Steven Hartland kill...@multiplay.co.uk: - Original Message - From: Daniel Kalchev On Mar 6, 2013, at 12:09 AM, Jeremy Chadwick j...@koitsu.org wrote: I say that knowing lots of people use ZFS-on-root, which is great -- I just wonder how many of them have tested all the crazy scenarios and then tried to boot from things. I have verified that ZFS-on-root works reliably in all of the following scenarios: single disk, one mirror vdev, many mirror vdevs, raidz. Haven't found the time to test many raidz vdevs, I admit. :) One thing to watch out for is the available BIOS boot disks. If you try to do a large RAIDZ with lots of disk as you root pool your likely to run into problems not because of any ZFS issue but simply because the disks the BIOS sees and hence tries to boot may not be what you expect. IIRC the Sun documentation recommends keeping the root pool separate from the data pools in any case. Ben ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 06:56:02AM -0600, Karl Denninger wrote: { I've snipped lots of text. For those who are reading this follow-up } { and wish to read the snipped portions, please see this URL: } { http://lists.freebsd.org/pipermail/freebsd-stable/2013-March/072696.html } 1. Is compression enabled? Has it ever been enabled (on any fs) in the past (barring pool being destroyed + recreated)? 2. Is dedup enabled? Has it ever been enabled (on any fs) in the past (barring pool being destroyed + recreated)? No answers to questions #1 and #2? (Edit: see below, I believe it's implied neither are used) * Describing the stall symptoms; what all does it impact? Can you switch VTYs on console when its happening? Network I/O (e.g. SSH'd into the same box and just holding down a letter) showing stalls then catching up? Things of this nature. When it happens on my system anything that is CPU-bound continues to execute. I can switch consoles and network I/O also works. Okay, it sounds like compression and dedup aren't in use/have never been used. The stalling problem with compression and dedup (e.g. if you use either of these features, and it worsens if you use both) results in a full/hard system stall where *everything* is impacted, and has been explained in the past (2nd URL has the explanation): http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html If I have an iostat running at the time all I/O counters go to and remain at zero while the stall is occurring, but the process that is producing the iostat continues to run and emit characters whether it is a ssh session or on the physical console. What kind of an iostat? iostat(8) or zpool iostat? (Edit: last paragraph of this response says zpool iostat, which is not the same thing as iostat) Why not gstat(8), e.g. gstat -I500ms, as well? This provides the I/O statistics at a deeper layer, not the ZFS layer. Do the numbers actually change **while the system is stalling**? The answer matters greatly, because it would help indicate if some kernel API requests for I/O statistics are also blocking, or if only *actual I/O (e.g. read() and write() requests)* are blocking. The CPUs are running and processing, but all threads block if they attempt access to the disk I/O subsystem, irrespective of the portion of the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS) I therefore cannot start any new process that requires image activation. And now you'll need to provide a full diagram of your disk and controller device tree, along with all partitions, slices, and filesystem types. It's best to draw this in ASCII in a tree-like diagram. It will take you 15-20 minutes to do. What's even more concerning: This thread is about ZFS, yet you're saying applications block when they attempt to do I/O to a filesystem ***other than ZFS***. There must be some kind of commonality here, i.e. a single controller is driving both the ZFS and UFS disks, or something along those lines. If there isn't, then there is something within the kernel I/O subsystem that is doing this. Like I said: very deep, very knowledgeable kernel folks are the only ones who can fix this. * How long the stall is in duration (ex. if there's some way to roughly calculate this using date in a shell script) They're variable. Some last fractions of a second and are not really all that noticeable unless you happen to be paying CLOSE attention. Some last a few (5 or so) seconds. The really bad ones last long enough that the kernel throws the message swap_pager: indefinite wait buffer. The message swap_pager: indefinite wait buffer indicates that some part of the VM is trying to offload pages of memory to swap via standard I/O write requests, and those writes have not come back within kern.hz*20 seconds. That's a very, very long time. The machine in the general sense never pages. It contains 12GB of RAM but historically (prior to ZFS being put into service) always showed 0 for a pstat -s, although it does have a 20g raw swap partition (to /dev/da0s1b, not to a zpool) allocated. The swap_pager message implies otherwise. It may be that the programs you're using poll at intervals of, say, 1 second, and swap-out + swap-in occurs very quickly so you never see it. (Edit: next quoted paragraph shows that there ARE pages of memory hitting swap, so never pages is false). I do not know the VM subsystem well enough to know what the criteria are for offloading pages of memory to swap -- but it's obviously happening. It may be due to memory pressure, or it may be due to pages which have not been touched in a long while -- again, I do not know. This is where vmstat -s would be useful. Possibly Alan Cox knows. During the stalls I cannot run a pstat (I tried; it stalls)
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 09:08:09PM -0800, Jeremy Chadwick wrote: * How long the stall is in duration (ex. if there's some way to roughly calculate this using date in a shell script) They're variable. Some last fractions of a second and are not really all that noticeable unless you happen to be paying CLOSE attention. Some last a few (5 or so) seconds. The really bad ones last long enough that the kernel throws the message swap_pager: indefinite wait buffer. The message swap_pager: indefinite wait buffer indicates that some part of the VM is trying to offload pages of memory to swap via standard I/O write requests, and those writes have not come back within kern.hz*20 seconds. That's a very, very long time. Two clarification points: 1. The timeout value is passed to msleep(9) and is literally kern.hz*20. Per sys/vm/swap_pager.c: 1216 if (msleep(mreq, VM_OBJECT_MTX(object), PSWP, swread, hz*20)) { 1217 printf( 1218 swap_pager: indefinite wait buffer: bufobj: %p, blkno: %jd, size: %ld\n, 1219 bp-b_bufobj, (intmax_t)bp-b_blkno, bp-b_bcount); How that's interpreted is documented in msleep(9): The parameter timo specifies a timeout for the sleep. If timo is not 0, then the thread will sleep for at most timo / hz seconds. If the timeout expires, then the sleep function will return EWOULDBLOCK. 2. The message appears to be for swap I/O *reads*, not writes; at least that's what the swread STATE string (you know, what you see in top(1)) implies. -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Mountain View, CA, US| | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
ZFS stalls -- and maybe we should be talking about defaults?
Well now this is interesting. I have converted a significant number of filesystems to ZFS over the last week or so and have noted a few things. A couple of them aren't so good. The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. When the same disks were running on UFS filesystems they were set up as a 0+1 RAID array under the ARECA adapter, exported as a single unit, GPT labeled as a single pack and then gpart-sliced and newfs'd under UFS+SU. Since I previously ran UFS filesystems on this config I know what the performance level I achieved with that, and the entire system had been running flawlessly set up that way for the last couple of years. Presently the machine is running 9.1-Stable, r244942M Immediately after the conversion I set up a second pool to play with backup strategies to a single drive and ran into a problem. The disk I used for that testing is one that previously was in the rotation and is also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. My first thought was that I had a bad drive, cable or other physical problem. However, searching for that proved fruitless -- there was nothing being logged anywhere -- not in the SMART data, not by the adapter, not by the OS. Nothing. Sticking a digital storage scope on the +5V and +12V rails didn't disclose anything interesting with the power in the chassis; it's stable. Further, swapping the only disk that had changed (the new backup volume) with a different one didn't change behavior either. The last straw was when I was able to reproduce the stalls WITHIN the original pool against the same four disks that had been running flawlessly for two years under UFS, and still couldn't find any evidence of a hardware problem (not even ECC-corrected data returns.) All the disks involved are completely clean -- zero sector reassignments, the drive-specific log is clean, etc. Attempting to cut back the ARECA adapter's aggressiveness (buffering, etc) on the theory that I was tickling something in its cache management algorithm that was pissing it off proved fruitless as well, even when I shut off ALL caching and NCQ options. I also set vfs.zfs.prefetch_disable=1 to no effect. H... Last night after reading the ZFS Tuning wiki for FreeBSD I went on a lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set vfs.zfs.write_limit_override to 102400 (1GB) and rebooted. /* The problem instantly disappeared and I cannot provoke its return even with multiple full-bore snapshot and rsync filesystem copies running while a scrub is being done.*/ /**/ I'm pinging between being I/O and processor (geli) limited now in normal operation and slamming the I/O channel during a scrub. It appears that performance is roughly equivalent, maybe a bit less, than it was with UFS+SU -- but it's fairly close. The operating theory I have at the moment is that the ARC cache was in some way getting into a near-deadlock situation with other memory demands on the system (there IS a Postgres server running on this hardware although it's a replication server and not taking queries -- nonetheless it does grab a chunk of RAM) leading to the stalls. Limiting its grab of RAM appears to have to resolved the contention issue. I was unable to catch it actually running out of free memory although it was consistently into the low five-digit free page count and the kernel never garfed on the console about resource exhaustion -- other than a bitch about swap stalling (the infamous more than 20 seconds message.) Page space in use near the time in question (I could not get a display while locked as it went to I/O and froze) was not zero, but pretty close to it (a few thousand blocks.) That the system was driven into light paging does appear to be significant and indicative of some sort of memory contention issue as under operation with UFS filesystems this machine has never been observed to allocate page space. Anyone seen anything like this before and if so is this a case of bad defaults or some bad behavior between various kernel memory allocation contention sources? This isn't exactly a resource-constrained machine running x64 code with 12GB of RAM and two quad-core processors in it! -- -- Karl Denninger /The Market Ticker ®/ http://market-ticker.org Cuda Systems LLC ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to
Re: ZFS stalls -- and maybe we should be talking about defaults?
What does zfs-stats -a show when your having the stall issue? You can also use zfs iostats to show individual disk iostats which may help identify a single failing disk e.g. zpool iostat -v 1 Also have you investigated which of the two sysctls you changed fixed it or does it require both? Regards Steve - Original Message - From: Karl Denninger k...@denninger.net To: freebsd-stable@freebsd.org Sent: Monday, March 04, 2013 10:48 PM Subject: ZFS stalls -- and maybe we should be talking about defaults? Well now this is interesting. I have converted a significant number of filesystems to ZFS over the last week or so and have noted a few things. A couple of them aren't so good. The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. When the same disks were running on UFS filesystems they were set up as a 0+1 RAID array under the ARECA adapter, exported as a single unit, GPT labeled as a single pack and then gpart-sliced and newfs'd under UFS+SU. Since I previously ran UFS filesystems on this config I know what the performance level I achieved with that, and the entire system had been running flawlessly set up that way for the last couple of years. Presently the machine is running 9.1-Stable, r244942M Immediately after the conversion I set up a second pool to play with backup strategies to a single drive and ran into a problem. The disk I used for that testing is one that previously was in the rotation and is also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. My first thought was that I had a bad drive, cable or other physical problem. However, searching for that proved fruitless -- there was nothing being logged anywhere -- not in the SMART data, not by the adapter, not by the OS. Nothing. Sticking a digital storage scope on the +5V and +12V rails didn't disclose anything interesting with the power in the chassis; it's stable. Further, swapping the only disk that had changed (the new backup volume) with a different one didn't change behavior either. The last straw was when I was able to reproduce the stalls WITHIN the original pool against the same four disks that had been running flawlessly for two years under UFS, and still couldn't find any evidence of a hardware problem (not even ECC-corrected data returns.) All the disks involved are completely clean -- zero sector reassignments, the drive-specific log is clean, etc. Attempting to cut back the ARECA adapter's aggressiveness (buffering, etc) on the theory that I was tickling something in its cache management algorithm that was pissing it off proved fruitless as well, even when I shut off ALL caching and NCQ options. I also set vfs.zfs.prefetch_disable=1 to no effect. H... Last night after reading the ZFS Tuning wiki for FreeBSD I went on a lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set vfs.zfs.write_limit_override to 102400 (1GB) and rebooted. /* The problem instantly disappeared and I cannot provoke its return even with multiple full-bore snapshot and rsync filesystem copies running while a scrub is being done.*/ /**/ I'm pinging between being I/O and processor (geli) limited now in normal operation and slamming the I/O channel during a scrub. It appears that performance is roughly equivalent, maybe a bit less, than it was with UFS+SU -- but it's fairly close. The operating theory I have at the moment is that the ARC cache was in some way getting into a near-deadlock situation with other memory demands on the system (there IS a Postgres server running on this hardware although it's a replication server and not taking queries -- nonetheless it does grab a chunk of RAM) leading to the stalls. Limiting its grab of RAM appears to have to resolved the contention issue. I was unable to catch it actually running out of free memory although it was consistently into the low five-digit free page count and the kernel never garfed on the console about resource exhaustion -- other than a bitch about swap stalling (the infamous more than 20 seconds message.) Page space in use near the time in question (I could not get a display while locked as it went to I/O and froze) was not zero, but pretty close to it (a few thousand blocks.) That the system was driven into light paging does appear to be significant and indicative of some sort of memory contention issue as under operation with UFS filesystems this machine has never been observed to allocate page space. Anyone seen anything like this before and if so
Re: ZFS stalls -- and maybe we should be talking about defaults?
I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25% ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips without much load. Interestingly pbzip2 consistently created a problem on a volume whereas gzip does not. Here, stalls happen across several systems however I have had less problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same chips: IR vs IT) I don't have a problem. On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote: Well now this is interesting. I have converted a significant number of filesystems to ZFS over the last week or so and have noted a few things. A couple of them aren't so good. The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. When the same disks were running on UFS filesystems they were set up as a 0+1 RAID array under the ARECA adapter, exported as a single unit, GPT labeled as a single pack and then gpart-sliced and newfs'd under UFS+SU. Since I previously ran UFS filesystems on this config I know what the performance level I achieved with that, and the entire system had been running flawlessly set up that way for the last couple of years. Presently the machine is running 9.1-Stable, r244942M Immediately after the conversion I set up a second pool to play with backup strategies to a single drive and ran into a problem. The disk I used for that testing is one that previously was in the rotation and is also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. My first thought was that I had a bad drive, cable or other physical problem. However, searching for that proved fruitless -- there was nothing being logged anywhere -- not in the SMART data, not by the adapter, not by the OS. Nothing. Sticking a digital storage scope on the +5V and +12V rails didn't disclose anything interesting with the power in the chassis; it's stable. Further, swapping the only disk that had changed (the new backup volume) with a different one didn't change behavior either. The last straw was when I was able to reproduce the stalls WITHIN the original pool against the same four disks that had been running flawlessly for two years under UFS, and still couldn't find any evidence of a hardware problem (not even ECC-corrected data returns.) All the disks involved are completely clean -- zero sector reassignments, the drive-specific log is clean, etc. Attempting to cut back the ARECA adapter's aggressiveness (buffering, etc) on the theory that I was tickling something in its cache management algorithm that was pissing it off proved fruitless as well, even when I shut off ALL caching and NCQ options. I also set vfs.zfs.prefetch_disable=1 to no effect. H... Last night after reading the ZFS Tuning wiki for FreeBSD I went on a lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set vfs.zfs.write_limit_override to 102400 (1GB) and rebooted. /* The problem instantly disappeared and I cannot provoke its return even with multiple full-bore snapshot and rsync filesystem copies running while a scrub is being done.*/ /**/ I'm pinging between being I/O and processor (geli) limited now in normal operation and slamming the I/O channel during a scrub. It appears that performance is roughly equivalent, maybe a bit less, than it was with UFS+SU -- but it's fairly close. The operating theory I have at the moment is that the ARC cache was in some way getting into a near-deadlock situation with other memory demands on the system (there IS a Postgres server running on this hardware although it's a replication server and not taking queries -- nonetheless it does grab a chunk of RAM) leading to the stalls. Limiting its grab of RAM appears to have to resolved the contention issue. I was unable to catch it actually running out of free memory although it was consistently into the low five-digit free page count and the kernel never garfed on the console about resource exhaustion -- other than a bitch about swap stalling (the infamous more than 20 seconds message.) Page space in use near the time in question (I could not get a display while locked as it went to I/O and froze) was not zero, but pretty close to it (a few thousand blocks.) That the system was driven into light paging does appear to be significant and indicative of some sort of memory contention issue as under operation with UFS filesystems this machine has never been observed to allocate page space.
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/4/2013 6:33 PM, Steven Hartland wrote: What does zfs-stats -a show when your having the stall issue? You can also use zfs iostats to show individual disk iostats which may help identify a single failing disk e.g. zpool iostat -v 1 Also have you investigated which of the two sysctls you changed fixed it or does it require both? Regards Steve - Original Message - From: Karl Denninger k...@denninger.net To: freebsd-stable@freebsd.org Sent: Monday, March 04, 2013 10:48 PM Subject: ZFS stalls -- and maybe we should be talking about defaults? Well now this is interesting. I have converted a significant number of filesystems to ZFS over the last week or so and have noted a few things. A couple of them aren't so good. The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. When the same disks were running on UFS filesystems they were set up as a 0+1 RAID array under the ARECA adapter, exported as a single unit, GPT labeled as a single pack and then gpart-sliced and newfs'd under UFS+SU. Since I previously ran UFS filesystems on this config I know what the performance level I achieved with that, and the entire system had been running flawlessly set up that way for the last couple of years. Presently the machine is running 9.1-Stable, r244942M Immediately after the conversion I set up a second pool to play with backup strategies to a single drive and ran into a problem. The disk I used for that testing is one that previously was in the rotation and is also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. My first thought was that I had a bad drive, cable or other physical problem. However, searching for that proved fruitless -- there was nothing being logged anywhere -- not in the SMART data, not by the adapter, not by the OS. Nothing. Sticking a digital storage scope on the +5V and +12V rails didn't disclose anything interesting with the power in the chassis; it's stable. Further, swapping the only disk that had changed (the new backup volume) with a different one didn't change behavior either. The last straw was when I was able to reproduce the stalls WITHIN the original pool against the same four disks that had been running flawlessly for two years under UFS, and still couldn't find any evidence of a hardware problem (not even ECC-corrected data returns.) All the disks involved are completely clean -- zero sector reassignments, the drive-specific log is clean, etc. Attempting to cut back the ARECA adapter's aggressiveness (buffering, etc) on the theory that I was tickling something in its cache management algorithm that was pissing it off proved fruitless as well, even when I shut off ALL caching and NCQ options. I also set vfs.zfs.prefetch_disable=1 to no effect. H... Last night after reading the ZFS Tuning wiki for FreeBSD I went on a lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set vfs.zfs.write_limit_override to 102400 (1GB) and rebooted. /* The problem instantly disappeared and I cannot provoke its return even with multiple full-bore snapshot and rsync filesystem copies running while a scrub is being done.*/ /**/ I'm pinging between being I/O and processor (geli) limited now in normal operation and slamming the I/O channel during a scrub. It appears that performance is roughly equivalent, maybe a bit less, than it was with UFS+SU -- but it's fairly close. The operating theory I have at the moment is that the ARC cache was in some way getting into a near-deadlock situation with other memory demands on the system (there IS a Postgres server running on this hardware although it's a replication server and not taking queries -- nonetheless it does grab a chunk of RAM) leading to the stalls. Limiting its grab of RAM appears to have to resolved the contention issue. I was unable to catch it actually running out of free memory although it was consistently into the low five-digit free page count and the kernel never garfed on the console about resource exhaustion -- other than a bitch about swap stalling (the infamous more than 20 seconds message.) Page space in use near the time in question (I could not get a display while locked as it went to I/O and froze) was not zero, but pretty close to it (a few thousand blocks.) That the system was driven into light paging does appear to be significant and indicative of some sort of memory contention issue as under operation with UFS
Re: ZFS stalls -- and maybe we should be talking about defaults?
Stick this in /boot/loader.conf and see if your lockups goes away: vfs.zfs.write_limit_override=102400 I've got a sentinal running that watches for zero-bandwidth zpool iostat 5s that has been running for close to 12 hours now and with the two tunables I changed it doesn't appear to be happening any more. This system always has small-ball write I/Os going to it as it's a postgresql hot standby mirror backing a VERY active system and is receiving streaming logdata from the primary at a colocation site, so the odds of it ever experiencing an actual zero for I/O (unless there's a connectivity problem) is pretty remote. If it turns out that the write_limit_override tunable is the one responsible for stopping the hangs I can drop the ARC limit tunable although I'm not sure I want to; I don't see much if any performance penalty from leaving it where it is and if the larger cache isn't helping anything then why use it? I'm inclined to stick an SSD in the cabinet as a cache drive instead of dedicating RAM to this -- even though it's not AS fast as RAM it's still MASSIVELY quicker than getting data off a rotating plate of rust. Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all? On 3/4/2013 8:07 PM, Dennis Glatting wrote: I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25% ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips without much load. Interestingly pbzip2 consistently created a problem on a volume whereas gzip does not. Here, stalls happen across several systems however I have had less problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same chips: IR vs IT) I don't have a problem. On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote: Well now this is interesting. I have converted a significant number of filesystems to ZFS over the last week or so and have noted a few things. A couple of them aren't so good. The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. When the same disks were running on UFS filesystems they were set up as a 0+1 RAID array under the ARECA adapter, exported as a single unit, GPT labeled as a single pack and then gpart-sliced and newfs'd under UFS+SU. Since I previously ran UFS filesystems on this config I know what the performance level I achieved with that, and the entire system had been running flawlessly set up that way for the last couple of years. Presently the machine is running 9.1-Stable, r244942M Immediately after the conversion I set up a second pool to play with backup strategies to a single drive and ran into a problem. The disk I used for that testing is one that previously was in the rotation and is also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. My first thought was that I had a bad drive, cable or other physical problem. However, searching for that proved fruitless -- there was nothing being logged anywhere -- not in the SMART data, not by the adapter, not by the OS. Nothing. Sticking a digital storage scope on the +5V and +12V rails didn't disclose anything interesting with the power in the chassis; it's stable. Further, swapping the only disk that had changed (the new backup volume) with a different one didn't change behavior either. The last straw was when I was able to reproduce the stalls WITHIN the original pool against the same four disks that had been running flawlessly for two years under UFS, and still couldn't find any evidence of a hardware problem (not even ECC-corrected data returns.) All the disks involved are completely clean -- zero sector reassignments, the drive-specific log is clean, etc. Attempting to cut back the ARECA adapter's aggressiveness (buffering, etc) on the theory that I was tickling something in its cache management algorithm that was pissing it off proved fruitless as well, even when I shut off ALL caching and NCQ options. I also set vfs.zfs.prefetch_disable=1 to no effect. H... Last night after reading the ZFS Tuning wiki for FreeBSD I went on a lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=20), set vfs.zfs.write_limit_override to 102400 (1GB) and rebooted. /* The problem instantly disappeared and I cannot provoke its return even with multiple full-bore snapshot and rsync filesystem copies running while a scrub is being done.*/ /**/ I'm pinging between being I/O and processor (geli) limited now in normal operation and slamming the I/O channel
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Karl Denninger k...@denninger.net Stick this in /boot/loader.conf and see if your lockups goes away: vfs.zfs.write_limit_override=102400 ... If it turns out that the write_limit_override tunable is the one responsible for stopping the hangs I can drop the ARC limit tunable although I'm not sure I want to; I don't see much if any performance penalty from leaving it where it is and if the larger cache isn't helping anything then why use it? I'm inclined to stick an SSD in the cabinet as a cache drive instead of dedicating RAM to this -- even though it's not AS fast as RAM it's still MASSIVELY quicker than getting data off a rotating plate of rust. Now interesting you should say that I've seen a stall recently on ZFS only box running on 6 x SSD RAIDZ2. The stall was caused by fairly large mysql import, with nothing else running. Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all? Correct Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/4/2013 9:25 PM, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net Stick this in /boot/loader.conf and see if your lockups goes away: vfs.zfs.write_limit_override=102400 ... If it turns out that the write_limit_override tunable is the one responsible for stopping the hangs I can drop the ARC limit tunable although I'm not sure I want to; I don't see much if any performance penalty from leaving it where it is and if the larger cache isn't helping anything then why use it? I'm inclined to stick an SSD in the cabinet as a cache drive instead of dedicating RAM to this -- even though it's not AS fast as RAM it's still MASSIVELY quicker than getting data off a rotating plate of rust. Now interesting you should say that I've seen a stall recently on ZFS only box running on 6 x SSD RAIDZ2. The stall was caused by fairly large mysql import, with nothing else running. Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. That's exactly what I can reproduce here; the stalls are anywhere from a few seconds to well north of a half-minute. It looks like the machine is hung -- but it is not. The machine in question normally runs with zero swap allocated but it always has 1.5Gb of shared memory allocated to Postgres (shared_buffers = 1500MB in its config file) I wonder if the ARC cache management code is misbehaving when shared segments are in use? -- -- Karl Denninger /The Market Ticker ®/ http://market-ticker.org Cuda Systems LLC ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Mon, 2013-03-04 at 20:58 -0600, Karl Denninger wrote: Stick this in /boot/loader.conf and see if your lockups goes away: vfs.zfs.write_limit_override=102400 K. I've got a sentinal running that watches for zero-bandwidth zpool iostat 5s that has been running for close to 12 hours now and with the two tunables I changed it doesn't appear to be happening any more. I've also done this as well as top and systat -vmstat. Disk I/O stops but the system lives through top, system, and the network. However, if I try to login the login won't complete. All of my systems are hardware RAID1 for the OS (LSI and Areca) and typically a separate disk for swap. All other disks are ZFS. This system always has small-ball write I/Os going to it as it's a postgresql hot standby mirror backing a VERY active system and is receiving streaming logdata from the primary at a colocation site, so the odds of it ever experiencing an actual zero for I/O (unless there's a connectivity problem) is pretty remote. I am doing multi TB sorts and GB database loads. If it turns out that the write_limit_override tunable is the one responsible for stopping the hangs I can drop the ARC limit tunable although I'm not sure I want to; I don't see much if any performance penalty from leaving it where it is and if the larger cache isn't helping anything then why use it? I'm inclined to stick an SSD in the cabinet as a cache drive instead of dedicating RAM to this -- even though it's not AS fast as RAM it's still MASSIVELY quicker than getting data off a rotating plate of rust. I forgot to mention that on my three 8.3 systems they occasionally offline a disk (one or two a week, total). I simply online the disk and after resilver all is well. There are ~40 disks across those three systems. Of my 9.1 systems three are busy but with smaller number of disks (about eight across two volumes (RAIDz2 and mirror). I also have a ZFS-on-Linux (CentOS) system for play (about 12 disks). It did not exhibit problems when it was in use but it did teach me a lesson on the evils of dedup. :) Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all? Dunno. On 3/4/2013 8:07 PM, Dennis Glatting wrote: I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25% ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips without much load. Interestingly pbzip2 consistently created a problem on a volume whereas gzip does not. Here, stalls happen across several systems however I have had less problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same chips: IR vs IT) I don't have a problem. On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote: Well now this is interesting. I have converted a significant number of filesystems to ZFS over the last week or so and have noted a few things. A couple of them aren't so good. The subject machine in question has 12GB of RAM and dual Xeon 5500-series processors. It also has an ARECA 1680ix in it with 2GB of local cache and the BBU for it. The ZFS spindles are all exported as JBOD drives. I set up four disks under GPT, have a single freebsd-zfs partition added to them, are labeled and the providers are then geli-encrypted and added to the pool. When the same disks were running on UFS filesystems they were set up as a 0+1 RAID array under the ARECA adapter, exported as a single unit, GPT labeled as a single pack and then gpart-sliced and newfs'd under UFS+SU. Since I previously ran UFS filesystems on this config I know what the performance level I achieved with that, and the entire system had been running flawlessly set up that way for the last couple of years. Presently the machine is running 9.1-Stable, r244942M Immediately after the conversion I set up a second pool to play with backup strategies to a single drive and ran into a problem. The disk I used for that testing is one that previously was in the rotation and is also known good. I began to get EXTENDED stalls with zero I/O going on, some lasting for 30 seconds or so. The system was not frozen but anything that touched I/O would lock until it cleared. Dedup is off, incidentally. My first thought was that I had a bad drive, cable or other physical problem. However, searching for that proved fruitless -- there was nothing being logged anywhere -- not in the SMART data, not by the adapter, not by the OS. Nothing. Sticking a digital storage scope on the +5V and +12V rails didn't disclose anything interesting with the power in the chassis; it's stable. Further, swapping the only disk that had changed (the new backup volume) with a different one didn't change behavior either. The last straw was when I was able to reproduce the stalls WITHIN the original pool against the same four disks that had been running flawlessly for two years under UFS, and still
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, 2013-03-05 at 03:25 +, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net Stick this in /boot/loader.conf and see if your lockups goes away: vfs.zfs.write_limit_override=102400 ... If it turns out that the write_limit_override tunable is the one responsible for stopping the hangs I can drop the ARC limit tunable although I'm not sure I want to; I don't see much if any performance penalty from leaving it where it is and if the larger cache isn't helping anything then why use it? I'm inclined to stick an SSD in the cabinet as a cache drive instead of dedicating RAM to this -- even though it's not AS fast as RAM it's still MASSIVELY quicker than getting data off a rotating plate of rust. Now interesting you should say that I've seen a stall recently on ZFS only box running on 6 x SSD RAIDZ2. The stall was caused by fairly large mysql import, with nothing else running. Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. I've seen this too. Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all? Correct Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Dennis Glatting d...@pki2.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
- Original Message - From: Karl Denninger k...@denninger.net Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. That's exactly what I can reproduce here; the stalls are anywhere from a few seconds to well north of a half-minute. It looks like the machine is hung -- but it is not. Out of interest when this happens for you is syncer using lots of CPU? If its anything like my stalls you'll need top loaded prior to the fact. Regards Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmas...@multiplay.co.uk. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On 3/4/2013 10:01 PM, Steven Hartland wrote: - Original Message - From: Karl Denninger k...@denninger.net Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. That's exactly what I can reproduce here; the stalls are anywhere from a few seconds to well north of a half-minute. It looks like the machine is hung -- but it is not. Out of interest when this happens for you is syncer using lots of CPU? If its anything like my stalls you'll need top loaded prior to the fact. Regards Steve Don't know. But the CPU is getting hammered when it happens because I am geli-encrypting all my drives and as a consequence it is not at all uncommon for the load average to be north of 10 when the system is under heavy I/O load. System response is fine right up until it stalls. I'm going to put some effort into trying to isolate exactly what is going on here in the coming days since I happen to have a spare box in an identical configuration that I can afford to lock up without impacting anyone doing real work :-) -- -- Karl Denninger /The Market Ticker ®/ http://market-ticker.org Cuda Systems LLC ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
Quoth Karl Denninger k...@denninger.net: Note that the machine is not booting from ZFS -- it is booting from and has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks like a single da0 drive to the OS) and that drive stalls as well when it freezes. It's definitely a kernel thing when it happens as the OS would otherwise not have locked (just I/O to the user partitions) -- but it does. Is it still the case that mixing UFS and ZFS can cause problems, or were they all fixed? I remember a while ago (before the arc usage monitoring code was added) there were a number of reports of serious probles running an rsync from UFS to ZFS. If you can it might be worth trying your scratch machine booting from ZFS. Probably the best way is to leave your swap partition where it is (IMHO it's not worth trying to swap onto a zvol) and convert the UFS partition into a separate zpool to boot from. You will also need to replace the boot blocks; assuming you're using GPT you can do this with gpart bootcode -p /boot/gptzfsboot -i gpt boot partition. Ben ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
On Tue, Mar 05, 2013 at 05:05:47AM +, Ben Morrow wrote: Quoth Karl Denninger k...@denninger.net: Note that the machine is not booting from ZFS -- it is booting from and has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks like a single da0 drive to the OS) and that drive stalls as well when it freezes. It's definitely a kernel thing when it happens as the OS would otherwise not have locked (just I/O to the user partitions) -- but it does. Is it still the case that mixing UFS and ZFS can cause problems, or were they all fixed? I remember a while ago (before the arc usage monitoring code was added) there were a number of reports of serious probles running an rsync from UFS to ZFS. This problem still exists on stable/9. The behaviour manifests itself as fairly bad performance (I cannot remember if stalling or if just throughput rates were awful). I can only speculate as to what the root cause is, but my guess is that it has something to do with the two caching systems (UFS vs. ZFS ARC) fighting over large sums of memory. The advice I've given people in the past is: if you do a LOT of I/O between UFS and ZFS on the same box, it's time to move to 100% ZFS. That said, I still do not recommend ZFS for a root filesystem (this biting people still happens even today), and swap-on-ZFS is a huge no-no. I will note that I myself use pure UFS+SU (not SUJ) for my main OS installation (that means /, swap, /var, /tmp, and /usr) on a dedicated SSD, while everything else is ZFS raidz1 (no dedup, no compression; won't ever enable these until that thread priority problem is fixed on FreeBSD). However, when I was migrating from gmirror+UFS+SU to ZFS, I witnessed what I described in my 1st and 2nd paragraphs. What userland utilities were used (rsync vs. cp) made no difference; the problem is in the kernel. Footnote about this thread: This thread contains all sorts of random pieces of information about systems, with very little actual detail in them (barring the symptoms, which are always useful to know!). For example, just because your machine has 8 cores and 12GB of RAM doesn't mean jack squat if some software in the kernel is designed oddly. Reworded: throwing more hardware at a problem solves nothing. The most useful thing (for me) that I found was deep within the thread, a few words along the lines of De-dup isn't used. What about compression, and if it's *ever* been enabled on the filesystem (even if not presently enabled)? It matters. All this matters. I see lots of end-users talking about these problems, but (barring Steven) literally no kernel people who are in the know about ZFS mentioning how said users can get them (devs) info that can help track this down. Those devs live on freebsd-fs@ and freebsd-hackers@, and not too many read freebsd-stable@. Step back for a moment and look at this anti-KISS configuration: - Hardware RAID controller involved (Areca 1680ix) - Hardware RAID controller has its own battery-backed cache (2GB) - Therefore arcmsr(4) is involved -- revision of driver/OS build matters here, ditto with firmware version - 4 disks are involved, models unknown - Disks are GPT and are *partitioned, and ZFS refers to the partitions not the raw disk -- this matters (honest, it really does; the ZFS code handles things differently with raw disks) - Providers are GELI-encrypted Now ask yourself if any dev is really going to tackle this one given the above mess. My advice would be to get rid of the hardware RAID (go with Intel ICHxx or ESBx on-board with AHCI), use raw disks for ZFS (if 4096-byte sector disks use the gnop(8) method, which is a one-time thing), and get rid of GELI. If you can reproduce the problem there 100% of the time, awesome, it's a clean/clear setup for someone to help investigate. -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Mountain View, CA, US| | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ZFS stalls -- and maybe we should be talking about defaults?
In article 8c68812328e3483ba9786ef155911...@multiplay.co.uk, kill...@multiplay.co.uk writes: Now interesting you should say that I've seen a stall recently on ZFS only box running on 6 x SSD RAIDZ2. The stall was caused by fairly large mysql import, with nothing else running. Then it happened I thought the machine had wedged, but minutes (not seconds) later, everything sprung into action again. I have certainly seen what you might describe as stalls, caused, so far as I can tell, by kernel memory starvation. I've seen it take as much as a half an hour to recover from these (which is too long for my users). Right now I have the ARC limited to 64 GB (on a 96 GB file server) and that has made it more stable, but it's still not behaving quite as I would like, and I'm looking to put more memory into the system (to be used for non-ARC functions). Looking at my munin graphs, I find that backups in particular put very heavy pressure on, doubling the UMA allocations over steady-state, and this takes about four or five hours to climb back down. See http://people.freebsd.org/~wollman/vmstat_z-day.png for an example. Some of the stalls are undoubtedly caused by internal fragmentation rather than actual data in use. (Solaris used to have this issue, and some hooks were added to allow some amount of garbage collection with the cooperation of the filesystem.) -GAWollman ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org