Re: [OpenZFS Developer] RFC: ZVOL unmap (DISCARD) delay relief

Evan Susarret Thu, 18 Dec 2014 11:22:35 -0800

Matt,

> On Dec 17, 2014, at 10:45 PM, Matthew Ahrens <[email protected]> wrote:
> 
> I don't understand what this has to do with devices that support DISCARD / 
> UNMAP, such as SSDs.  The question is about freeing part of a ZVOL in 
> response to UNMAP/DISCARD requests from HFS+, right?  The end result being 
> that (maybe) the ZVOL uses less space.  ZFS is not issuing any UNMAP/DISCARDs 
> to the underlying device (e.g. SSD).
> 
> I guess I don't really understand why HFS+ is issuing all these UNMAPs on 
> mount.  Does it do that on other storage?  E.g. HFS+ on a SSD (no ZFS 
> involved).  Is it sending a zillion TRIM commands every time my laptop boots, 
> or I plug in an external SSD?


        To answer your question - the mount function for HFS+ volumes 
intentionally searches for free ranges and issues Unmaps. This happens every 
mount (except when mounting a root filesystem). This may seem like overkill, 
but with (Apple-approved) SSD hardware it typically isn’t even noticeable. See 
below for links from the xnu kernel sources on the Apple open source website.

From Apple opensource xnu-2782.1.97:

hfs_vfsutils.c: hfsplus_mount starts a thread for hfs_scan_blocks
http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsutils.c 
<http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsutils.c>

hfs_vfsops.c: hfs_scan_blocks calls ScanUnmapBlocks
http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsops.c 
<http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsops.c>

VolumeAllocation.c: ScanUnmapBlocks issues TRIM for any unused ranges within 
the volume’s physical extents
http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfscommon/Misc/VolumeAllocation.c
 
<http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfscommon/Misc/VolumeAllocation.c>

As commented on line 96 of VolumeAllocation.c:
/*
ScanUnmapBlocks 
                                        Traverse the entire allocation bitmap.  
Potentially issue DKIOCUNMAPs to the device as it 
                                        tracks unallocated ranges when 
iterating the volume bitmap.  Additionally, build up the in-core
                                        summary table of the allocation bitmap.
*/

> 
> --matt
> 
> On Wed, Dec 17, 2014 at 10:21 PM, Jorgen Lundman <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> Matthew Ahrens wrote:
> > I think you're saying you have a zvol with HFS+ on top, and that when you
> > mount the HFS+ volume, it sends a lot of unmap requests to the zvol, which
> > is slow.
> >
> > Before we get into complicated solutions, I have some stupid questions:
> >
> >  - Why does it need to issue a zillion unmaps every time you mount?
> 
> Ask Apple!
> 
> But yes, I suppose if you have a device that supports DISCARDs, and you
> (the user) wants to receive unmaps, XNU kernel goes through all empty areas
> on mount and issues unmap for each. Even for areas that have already been
> discarded, it seems.
> 
> I don't think we can detect those unmap requests for already unmapped
> areas? Easily?
> 
> 
> >
> >  - Could you just ignore the UNMAPs? (obvious answer is yes, but does it
> > hurt anything else)
> 
> Of course you can. Don't buy SSDs is one way! Disabling unmap support in OS
> also works. But I guess they added unmap to devices, and operating systems,
> for a reason, and should the user want them enabled, 20 mins to mount is
> undesirable.. So we were checking to see if we could do something trivial
> to lessen the effect.  People running the experimental code report
> happiness, but that doesn't mean it's correct :)

AFAIK HFS is the only OS X filesystem actively using unmap, but I honestly do 
not know whether others do. Either way, the freed blocks are not guaranteed to 
be zero’d, only marked as free. So it *should* be safe to drop or ignore 
unmaps...

> 
> 
> >
> >  - Do you have this unmap performance fix?
> >     4873 zvol unmap calls can take a very long time for larger datasets
> 
> Yes. Makes little difference. It is more the high number of commits that
> take a while to go through.
> 
> I suspect only OSX and FreeBSD have this concern, as the other platforms do
> not yet fully support devices with DISCARD. But I don't know for sure.

        There are several issues at play here, and I believe there is not an 
easy one size fits all solution. There are a few improvements that we could 
make, as well as providing solid documentation to help users configure zvols 
properly upfront.

        Back in May-June, I worked together with Jorgen on implementing zvol 
unmap. I originally modeled it after existing illumos code for the DKIOCFREE 
ioctl.
        The space savings are significant, especially for un-compressed zvols. 
Without unmap, HFS just marks blocks as free in metadata, but does not zero the 
block. Zvols would grow until the Used space was close to the Volsize, and a 
manual run of Disk Utility's "Erase Free Space” (or other methods) was needed 
to reclaim the unused space. The space used by incremental snapshots is also 
significantly smaller. See this link for some numbers from zfs list:
https://github.com/openzfsonosx/zfs/pull/199#issue-36229857 
<https://github.com/openzfsonosx/zfs/pull/199#issue-36229857>

        Mac OS X has issues with zvol volblocksize, or more generally, with any 
block devices that have a blocksize larger than 8k. Jorgen, myself, and others 
have experimented with this - Disk Utility as well as command line tools can 
partially or completely fail to address the device. In the distant past this 
meant that zvols created by (or send/recv from) other OpenZFS implementations 
were incompatible - not recognizing partition maps or data, and vice versa. We 
have since resolved/worked around that issue by setting the logical blocksize 
to 512b.
        Ideally we should advertise 512b as the Logical blocksize, and also 
advise the OS of the Physical volblocksize. However we have not had much 
success with setting this and getting Disk Utility and other tools to recognize 
and use it. Currently the logical blocksize of 512b seems to be all we can 
easily set, but at least this maximizes compatibility.

        Because of that issue, certain settings for volblocksize can exhibit 
poor performance. On the whole, a zvol created with `zfs create -b 128k 
tank/zvol` seems to perform OK. Compression ratios are often better with higher 
blocksizes, which makes it tempting to create zvols with high volblocksizes. 
For advanced users it may be obvious to use a smaller blocksize, but for the 
average user it is very confusing. (e.g. "zfs datasets use 128k as the default 
recordsize, why use 8k for a zvol?”).
        Additionally, the GUI tools are not aware of the zvol’s volblocksize, 
so it is ideal to use the command line `newfs_hfs` instead. For example, when 
using the zfs default of volblocksize=8k, Disk Utility will create an HFS 
volume that uses a 4k blocksize. As a result, unmaps are often being issued for 
half a block, and only sometimes are aligned to the volblocksize correctly.

        That’s where the blocksize issue is intertwined with unmap performance- 
as noted by ZoL, unmaps that are not aligned to the volblocksize result in 
read-modify-write cycles that (slowly) zeroes out ranges within a block:

        Matt - does that seem accurate? I’ve looked at dmu_free_long_range, 
etc, but with so many layers I find the exact behavior to be obscured. Also I 
don’t see this in ‘spindump' backtraces while the unmaps are being processed.

        To alleviate this on ZoL, zvol_discard() rounds the offset/size (down) 
to the volblocksize, and rejects requests that end up with zero-size after 
that. We probably should do the same. This alone could solve the slow 
mount-times myself and others are experiencing, by avoiding read/modify/write. 
However it may also reduce the space-savings seen.
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L661-L673 
<https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L661-L673>

        ZoL also advertises a discard_granularity that matches the 
volblocksize, which should prevent unaligned unmaps from being issued in the 
first place. I’m not sure that we have the ability to do so on OS X, but I’ll 
have to recheck in IOKit and the xnu kernel source. I believe we will have an 
‘advisement’ property, like the Physical blocksize property, that may or may 
not be respected by the kernel.

        Re: the original question, I should also note that the unmaps are being 
issued from HFS to the zvol correctly, whether threaded or not (they are being 
issued by a background thread anyway). I have system logs that show that all 
80,000 unmaps are typically issued in under 1-3 minutes. The slowness is mostly 
in processing them - 30 minutes each time the volume is mounted. By running 
spindump/dtrace and other tools, it appears that zil_commit and 
dmu_free_long_range are taking most of that time, of course.
        Experimenting with sync=disabled, and with another branch that enables 
the ZVOL_WCE flag (with sync=standard), it consistently takes only 10 minutes 
to complete.

        Matt, any pointers on whether or not to use the ZVOL_WCE write cache 
flag? It is disabled by default, and in our current implementation cannot even 
be enabled by ioctl (present in zvol.c, but we don’t create the block device in 
that way).
This translates to zil_commit being called for every unmap, regardless of the 
‘sync’ property:
https://github.com/openzfsonosx/zfs/blob/master/module/zfs/zvol.c#L1926-L1936 
<https://github.com/openzfsonosx/zfs/blob/master/module/zfs/zvol.c#L1926-L1936>

/*
 * If the write-cache is disabled or 'sync' property
 * is set to 'always' then treat this as a synchronous
 * operation (i.e. commit to zil).
 */
if (!(um->zv->zv_flags & ZVOL_WCE) ||
    (um->zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS)) {
                zil_commit(um->zv->zv_zilog, ZVOL_OBJ);
}

BTW, ZoL is currently not using a transaction or zil_commit:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L675-L683 
<https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L675-L683>

        We also do not have dmu_tx_mark_netfree(), which I see was added to 
DKIOCFREE on illumos, but I see that this would only be helpful in a full 
pool/quota situation.
https://github.com/illumos/illumos-gate/commit/4bb73804952172060c9efb163b89c17f56804fe8
 
<https://github.com/illumos/illumos-gate/commit/4bb73804952172060c9efb163b89c17f56804fe8>

        I’ve also been partial to adding knobs and/or zvol properties that 
allow users to enable or disable unmap functionality. This would allow people 
to alter the behavior to suit their needs. A system-wide setting makes sense, 
but I think per-zvol would be useful, too.

Any suggestions you may have are welcome!
Thank you,
Evan Susarret

> 
> 
> --
> Jorgen Lundman       | <[email protected] <mailto:[email protected]>>
> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 
> <tel:%2B81%20%280%293%20-5456-2687%20ext%201017> (work)
> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500 <tel:%2B81%20%280%2990-5578-8500>  
>         (cell)
> Japan                | +81 (0)3 -3375-1767 
> <tel:%2B81%20%280%293%20-3375-1767>          (home)
> _______________________________________________
> developer mailing list
> [email protected] <mailto:[email protected]>
> http://lists.open-zfs.org/mailman/listinfo/developer 
> <http://lists.open-zfs.org/mailman/listinfo/developer>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] RFC: ZVOL unmap (DISCARD) delay relief

Reply via email to