Re: Options for zfs inside a VM backed by zfs on the host

Chad J. Milios Fri, 28 Aug 2015 09:34:08 -0700

> On Aug 27, 2015, at 7:47 PM, Tenzin Lhakhang <[email protected]> 
> wrote:
> 
> On Thu, Aug 27, 2015 at 3:53 PM, Chad J. Milios <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Whether we are talking ffs, ntfs or zpool atop zvol, unfortunately there are 
> really no simple answers. You must consider your use case, the host and vm 
> hardware/software configuration, perform meaningful benchmarks and, if you 
> care about data integrity, thorough tests of the likely failure modes (all 
> far more easily said than done). I’m curious to hear more about your use 
> case(s) and setups so as to offer better insight on what alternatives may 
> make more/less sense for you. Performance needs? Are you striving for lower 
> individual latency or higher combined throughput? How critical are integrity 
> and availability? How do you prefer your backup routine? Do you handle that 
> in guest or host? Want features like dedup and/or L2ARC up in the mix? (Then 
> everything bears reconsideration, just about triple your research and testing 
> efforts.)
> 
> Sorry, I’m really not trying to scare anyone away from ZFS. It is awesome and 
> capable of providing amazing solutions with very reliable and sensible 
> behavior if handled with due respect, fear, monitoring and upkeep. :)
> 
> There are cases to be made for caching [meta-]data in the child, in the 
> parent, checksumming in the child/parent/both, compressing in the 
> child/parent. I believe `gstat` along with your custom-made benchmark or test 
> load will greatly help guide you.
> 
> ZFS on ZFS seems to be a hardly studied, seldom reported, never documented, 
> tedious exercise. Prepare for accelerated greying and balding of your hair. 
> The parent's volblocksize, child's ashift, alignment, interactions involving 
> raidz stripes (if used) can lead to problems from slightly decreased 
> performance and storage efficiency to pathological write amplification within 
> ZFS, performance and responsiveness crashing and sinking to the bottom of the 
> ocean. Some datasets can become veritable black holes to vfs system calls. 
> You may see ZFS reporting elusive errors, deadlocking or panicing in the 
> child or parent altogether. With diligence though, stable and performant 
> setups can be discovered for many production situations.
> 
> For example, for a zpool (whether used by a VM or not, locally, thru iscsi, 
> ggate[cd], or whatever) atop zvol which sits on parent zpool with no 
> redundancy, I would set primarycache=metadata checksum=off compression=off 
> for the zvol(s) on the host(s) and for the most part just use the same zpool 
> settings and sysctl tunings in the VM (or child zpool, whatever role it may 
> conduct) that i would otherwise use on bare cpu and bare drives (defaults + 
> compression=lz4 atime=off). However, that simple case is likely not yours.
> 
> With ufs/ffs/ntfs/ext4 and most other filesystems atop a zvol i use checksums 
> on the parent zvol, and compression too if the child doesn’t support it (as 
> ntfs can), but still caching only metadata on the host and letting the child 
> vm/fs cache real data.
> 
> My use case involves charging customers for their memory use so admittedly 
> that is one motivating factor, LOL. Plus, i certainly don’t want one rude VM 
> marching through host ARC unfairly evacuating and starving the other polite 
> neighbors.
> 
> VM’s swap space becomes another consideration and I treat it like any other 
> ‘dumb’ filesystem with compression and checksumming done by the parent but 
> recent versions of many operating systems may be paging out only already 
> compressed data, so investigate your guest OS. I’ve found lz4’s claims of an 
> almost-no-penalty early-abort to be vastly overstated when dealing with 
> zvols, small block sizes and high throughput so if you can be certain you’ll 
> be dealing with only compressed data then turn it off. For the virtual memory 
> pagers in most current-day OS’s though set compression on the swap’s backing 
> zvol to lz4.
> 
> Another factor is the ZIL. One VM can hoard your synchronous write 
> performance. Solutions are beyond the scope of this already-too-long email :) 
> but I’d be happy to elaborate if queried.
> 
> And then there’s always netbooting guests from NFS mounts served by the host 
> and giving the guest no virtual disks, don’t forget to consider that option.
> 
> Hope this provokes some fruitful ideas for you. Glad to philosophize about 
> ZFS setups with ya’ll :)
> 
> -chad


> That was a really awesome read!  The idea of turning metadata on at the 
> backend zpool and then data on the VM was interesting, I will give that a 
> try. Please can you elaborate more on the ZILs and synchronous writes by 
> VMs.. that seems like a great topic.

> I am right now exploring the question: are SSD ZILs necessary in an all SSD 
> pool? and then the question of NVMe SSD ZILs onto of an all SSD pool.  My 
> guess at the moment is that SSD ZILs are not necessary at all in an SSD pool 
> during intensive IO.  I've been told that ZILs are always there to help you, 
> but when your pool aggregate IOPs is greater than the a ZIL, it doesn't seem 
> to make sense.. Or is it the latency of writing to a single disk vs striping 
> across your "fast" vdevs?
> 
> Thanks,
> Tenzin

Well the ZIL (ZFS Intent Log) is basically an absolute necessity. Without it, a 
call to fsync() could take over 10 seconds on a system serving a relatively 
light load. HOWEVER, a source of confusion is the terminology people often 
throw around. See, the ZIL is basically a concept, a method, a procedure. It is 
not a device. A 'SLOG' is what most people mean when they say ZIL. That is a 
Seperate Log device. (ZFS ‘log’ vdev type; documented in man 8 zpool.) When you 
aren’t using a SLOG device, your ZIL is transparently allocated by ZFS, roughly 
a little chunk of space reserved near the “middle” (at least ZFS attempts to 
locate it there physically but on SSDs or SMR HDs there’s no way to and no 
point to) of the main pool (unless you’ve gone out of your way to deliberately 
disable the ZIL entirely).

The other confusion often surrounding the ZIL is when it gets used. Most writes 
(in the world) would bypass the ZIL (built-in or SLOG) entirely anyway because 
they are asynchronous writes, not synchronous ones. Only the latter are 
candidates to clog a ZIL bottleneck. You will need to consider your workload 
specifically to know whether a SLOG will help, and if so, how much SLOG 
performance is required to not put a damper on the pool’s overall throughput 
capability. Conversely you want to know how much SLOG performance is overkill 
because NVMe and SLC SSDs are freaking expensive.

Now for many on the list this is going to be some elementary information so i 
apologize but i come across this question all the time, sync vs async writes. 
i’m sure there are many who might find this informative and with ZFS the 
difference becomes more profound and important than most other filesystems.

See, ZFS always is always bundling up batches of writes into transaction groups 
(TXGs). Without extraneous detail it can be understood that basically these 
happen every 5 seconds (sysctl vfs.zfs.txg.timeout). So picture ZFS typically 
has two TXGs it’s worried about at any given time, one is being filled into 
memory while the previous one is being flushed out to physical disk.

So when you write something asynchronously the operating system is going to say 
‘aye aye captain’ and send you along your merry way very quickly but if you 
lose power or crash and then reboot, ZFS only guarantees you a CONSISTENT 
state, not your most recent state. Your pool may come back online and you’ve 
lost 5-15 seconds worth of work. For your typical desktop or workstation 
workload that’s probably no big deal. You lost 15 seconds of effort, you repeat 
it, and continue about your business.

However, imagine a mail server that received many many emails in just that 
short time and has told all the senders of all those messages “got it, thumbs 
up”. You cannot redact those assurances you handed out. You have no idea who to 
contact to ask to repeat themselves. Even if you did it's likely the sending 
mail servers have long since forgotten about those particular messages. So, 
with each message you receive, after you tell the operating system to write the 
data you issue a call to fsync(new_message) and only after that call returns do 
you give the sender the thumbs up to forget the message and leave it in your 
capable hands to deliver it to its destination. Thanks to the ZIL, fsync() will 
typically return in miliseconds or less instead of the many seconds it could 
take for that write in a bundled TXG to end up physically saved. In an ideal 
world, the ZIL gets written to and never read again, data just becoming stale 
and overwritten. (The data stays in the in-memory TXG so it’s redundant in the 
ZIL once that TXG completes flushing).

The email server is the typical example of the use of fsync but there are 
thousands of others. Typically applications using central databases are written 
in a simplistic way to assume the database is trustworthy and fsync is how the 
database attempts to fulfill that requirement.

To complicate matters, consider VMs, particularly uncooperative, impolite, 
selfish VMs. Synchronous write iops are a particularly scarce and expensive 
resource which hasn’t been increasing as quickly and cheaply as, say, io 
bandwidth, cpu speeds, memory capacities. To make it worse the numbers for iops 
most SSD makers advertise on their so-called spec sheets are untrustworthy, 
they have no standard benchmark or enforcement (“The PS in IOPS stands for Per 
Second so we ran our benchmark on a fresh drive for one second and got 100,000 
IOPS" Well, good for you, that is useless to me. Tell me what you can sustain 
all day long a year down the road.) and they’re seldom accountable to anybody 
not buying 10,000 units. All this consolidation of VMs/containers/jails can 
really stress sync i/o capability of even the biggest baddest servers.

And FreeBSD, in all it’s glory is not yet very well suited to the problem of 
multi-tennency. (It’s great if all jails and VMs on a server are owned and 
controlled by one stakeholder who can coordinate their friendly coexistence.) 
My firm develops and supports a proprietary shim into ZFS and jails for 
enforcing the polite sharing of bandwidth, total iops and sync iops, that can 
be applied to groups of which the granularity of membership are arbitrary ZFS 
datasets. So there, that's my shameless plug, LOL. However there are brighter 
minds than I working on this problem and I’m hoping to maybe some time either 
participate in a more general development of such facilities with broader 
application into mainline FreeBSD or to perhaps open source my own work 
eventually. (I guess I’m being more shy than selfish with it, LOL.)

Hope that’s food for thought for some of you

-chad
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"[email protected]"

Re: Options for zfs inside a VM backed by zfs on the host

Reply via email to