Matt,
> On Dec 17, 2014, at 10:45 PM, Matthew Ahrens <[email protected]> wrote:
>
> I don't understand what this has to do with devices that support DISCARD /
> UNMAP, such as SSDs. The question is about freeing part of a ZVOL in
> response to UNMAP/DISCARD requests from HFS+, right? The end result being
> that (maybe) the ZVOL uses less space. ZFS is not issuing any UNMAP/DISCARDs
> to the underlying device (e.g. SSD).
>
> I guess I don't really understand why HFS+ is issuing all these UNMAPs on
> mount. Does it do that on other storage? E.g. HFS+ on a SSD (no ZFS
> involved). Is it sending a zillion TRIM commands every time my laptop boots,
> or I plug in an external SSD?
To answer your question - the mount function for HFS+ volumes
intentionally searches for free ranges and issues Unmaps. This happens every
mount (except when mounting a root filesystem). This may seem like overkill,
but with (Apple-approved) SSD hardware it typically isn’t even noticeable. See
below for links from the xnu kernel sources on the Apple open source website.
From Apple opensource xnu-2782.1.97:
hfs_vfsutils.c: hfsplus_mount starts a thread for hfs_scan_blocks
http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsutils.c
<http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsutils.c>
hfs_vfsops.c: hfs_scan_blocks calls ScanUnmapBlocks
http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsops.c
<http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsops.c>
VolumeAllocation.c: ScanUnmapBlocks issues TRIM for any unused ranges within
the volume’s physical extents
http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfscommon/Misc/VolumeAllocation.c
<http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfscommon/Misc/VolumeAllocation.c>
As commented on line 96 of VolumeAllocation.c:
/*
ScanUnmapBlocks
Traverse the entire allocation bitmap.
Potentially issue DKIOCUNMAPs to the device as it
tracks unallocated ranges when
iterating the volume bitmap. Additionally, build up the in-core
summary table of the allocation bitmap.
*/
>
> --matt
>
> On Wed, Dec 17, 2014 at 10:21 PM, Jorgen Lundman <[email protected]
> <mailto:[email protected]>> wrote:
>
>
> Matthew Ahrens wrote:
> > I think you're saying you have a zvol with HFS+ on top, and that when you
> > mount the HFS+ volume, it sends a lot of unmap requests to the zvol, which
> > is slow.
> >
> > Before we get into complicated solutions, I have some stupid questions:
> >
> > - Why does it need to issue a zillion unmaps every time you mount?
>
> Ask Apple!
>
> But yes, I suppose if you have a device that supports DISCARDs, and you
> (the user) wants to receive unmaps, XNU kernel goes through all empty areas
> on mount and issues unmap for each. Even for areas that have already been
> discarded, it seems.
>
> I don't think we can detect those unmap requests for already unmapped
> areas? Easily?
>
>
> >
> > - Could you just ignore the UNMAPs? (obvious answer is yes, but does it
> > hurt anything else)
>
> Of course you can. Don't buy SSDs is one way! Disabling unmap support in OS
> also works. But I guess they added unmap to devices, and operating systems,
> for a reason, and should the user want them enabled, 20 mins to mount is
> undesirable.. So we were checking to see if we could do something trivial
> to lessen the effect. People running the experimental code report
> happiness, but that doesn't mean it's correct :)
AFAIK HFS is the only OS X filesystem actively using unmap, but I honestly do
not know whether others do. Either way, the freed blocks are not guaranteed to
be zero’d, only marked as free. So it *should* be safe to drop or ignore
unmaps...
>
>
> >
> > - Do you have this unmap performance fix?
> > 4873 zvol unmap calls can take a very long time for larger datasets
>
> Yes. Makes little difference. It is more the high number of commits that
> take a while to go through.
>
> I suspect only OSX and FreeBSD have this concern, as the other platforms do
> not yet fully support devices with DISCARD. But I don't know for sure.
There are several issues at play here, and I believe there is not an
easy one size fits all solution. There are a few improvements that we could
make, as well as providing solid documentation to help users configure zvols
properly upfront.
Back in May-June, I worked together with Jorgen on implementing zvol
unmap. I originally modeled it after existing illumos code for the DKIOCFREE
ioctl.
The space savings are significant, especially for un-compressed zvols.
Without unmap, HFS just marks blocks as free in metadata, but does not zero the
block. Zvols would grow until the Used space was close to the Volsize, and a
manual run of Disk Utility's "Erase Free Space” (or other methods) was needed
to reclaim the unused space. The space used by incremental snapshots is also
significantly smaller. See this link for some numbers from zfs list:
https://github.com/openzfsonosx/zfs/pull/199#issue-36229857
<https://github.com/openzfsonosx/zfs/pull/199#issue-36229857>
Mac OS X has issues with zvol volblocksize, or more generally, with any
block devices that have a blocksize larger than 8k. Jorgen, myself, and others
have experimented with this - Disk Utility as well as command line tools can
partially or completely fail to address the device. In the distant past this
meant that zvols created by (or send/recv from) other OpenZFS implementations
were incompatible - not recognizing partition maps or data, and vice versa. We
have since resolved/worked around that issue by setting the logical blocksize
to 512b.
Ideally we should advertise 512b as the Logical blocksize, and also
advise the OS of the Physical volblocksize. However we have not had much
success with setting this and getting Disk Utility and other tools to recognize
and use it. Currently the logical blocksize of 512b seems to be all we can
easily set, but at least this maximizes compatibility.
Because of that issue, certain settings for volblocksize can exhibit
poor performance. On the whole, a zvol created with `zfs create -b 128k
tank/zvol` seems to perform OK. Compression ratios are often better with higher
blocksizes, which makes it tempting to create zvols with high volblocksizes.
For advanced users it may be obvious to use a smaller blocksize, but for the
average user it is very confusing. (e.g. "zfs datasets use 128k as the default
recordsize, why use 8k for a zvol?”).
Additionally, the GUI tools are not aware of the zvol’s volblocksize,
so it is ideal to use the command line `newfs_hfs` instead. For example, when
using the zfs default of volblocksize=8k, Disk Utility will create an HFS
volume that uses a 4k blocksize. As a result, unmaps are often being issued for
half a block, and only sometimes are aligned to the volblocksize correctly.
That’s where the blocksize issue is intertwined with unmap performance-
as noted by ZoL, unmaps that are not aligned to the volblocksize result in
read-modify-write cycles that (slowly) zeroes out ranges within a block:
Matt - does that seem accurate? I’ve looked at dmu_free_long_range,
etc, but with so many layers I find the exact behavior to be obscured. Also I
don’t see this in ‘spindump' backtraces while the unmaps are being processed.
To alleviate this on ZoL, zvol_discard() rounds the offset/size (down)
to the volblocksize, and rejects requests that end up with zero-size after
that. We probably should do the same. This alone could solve the slow
mount-times myself and others are experiencing, by avoiding read/modify/write.
However it may also reduce the space-savings seen.
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L661-L673
<https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L661-L673>
ZoL also advertises a discard_granularity that matches the
volblocksize, which should prevent unaligned unmaps from being issued in the
first place. I’m not sure that we have the ability to do so on OS X, but I’ll
have to recheck in IOKit and the xnu kernel source. I believe we will have an
‘advisement’ property, like the Physical blocksize property, that may or may
not be respected by the kernel.
Re: the original question, I should also note that the unmaps are being
issued from HFS to the zvol correctly, whether threaded or not (they are being
issued by a background thread anyway). I have system logs that show that all
80,000 unmaps are typically issued in under 1-3 minutes. The slowness is mostly
in processing them - 30 minutes each time the volume is mounted. By running
spindump/dtrace and other tools, it appears that zil_commit and
dmu_free_long_range are taking most of that time, of course.
Experimenting with sync=disabled, and with another branch that enables
the ZVOL_WCE flag (with sync=standard), it consistently takes only 10 minutes
to complete.
Matt, any pointers on whether or not to use the ZVOL_WCE write cache
flag? It is disabled by default, and in our current implementation cannot even
be enabled by ioctl (present in zvol.c, but we don’t create the block device in
that way).
This translates to zil_commit being called for every unmap, regardless of the
‘sync’ property:
https://github.com/openzfsonosx/zfs/blob/master/module/zfs/zvol.c#L1926-L1936
<https://github.com/openzfsonosx/zfs/blob/master/module/zfs/zvol.c#L1926-L1936>
/*
* If the write-cache is disabled or 'sync' property
* is set to 'always' then treat this as a synchronous
* operation (i.e. commit to zil).
*/
if (!(um->zv->zv_flags & ZVOL_WCE) ||
(um->zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS)) {
zil_commit(um->zv->zv_zilog, ZVOL_OBJ);
}
BTW, ZoL is currently not using a transaction or zil_commit:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L675-L683
<https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L675-L683>
We also do not have dmu_tx_mark_netfree(), which I see was added to
DKIOCFREE on illumos, but I see that this would only be helpful in a full
pool/quota situation.
https://github.com/illumos/illumos-gate/commit/4bb73804952172060c9efb163b89c17f56804fe8
<https://github.com/illumos/illumos-gate/commit/4bb73804952172060c9efb163b89c17f56804fe8>
I’ve also been partial to adding knobs and/or zvol properties that
allow users to enable or disable unmap functionality. This would allow people
to alter the behavior to suit their needs. A system-wide setting makes sense,
but I think per-zvol would be useful, too.
Any suggestions you may have are welcome!
Thank you,
Evan Susarret
>
>
> --
> Jorgen Lundman | <[email protected] <mailto:[email protected]>>
> Unix Administrator | +81 (0)3 -5456-2687 ext 1017
> <tel:%2B81%20%280%293%20-5456-2687%20ext%201017> (work)
> Shibuya-ku, Tokyo | +81 (0)90-5578-8500 <tel:%2B81%20%280%2990-5578-8500>
> (cell)
> Japan | +81 (0)3 -3375-1767
> <tel:%2B81%20%280%293%20-3375-1767> (home)
> _______________________________________________
> developer mailing list
> [email protected] <mailto:[email protected]>
> http://lists.open-zfs.org/mailman/listinfo/developer
> <http://lists.open-zfs.org/mailman/listinfo/developer>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer