Re: [OpenZFS Developer] RFC: ZVOL unmap (DISCARD) delay relief

Evan Susarret Sat, 20 Dec 2014 15:28:26 -0800

Matt,
        Following up my previous email - I reviewed 4873 and related sections 
again, and I see where the bzeros are performed in dnode.c. The stack traces I 
was looking at recently show that we get stuck in the dmu_hold_ functions, 
while the bzero's are unlikely to be seen in dtrace/spindump output.


        I'm testing a branch using the ZoL approach (except for the 
'granularity' option, which would be helpful), and it seems to be the solution 
for OS X as well. Unaligned requests are rounded to the volblocksize, and 
skipped if they have zero size afterwards. Also zil_commit is only performed 
for sync=always.
        Per the SCSI Unmap spec, unmapped blocks are not guaranteed to be zero 
or any certain value. Plus, HFS likes to retrim these free ranges, both at 
mount time and whenever free ranges coalesce. Rather than unmap just newly 
freed blocks alone, it issues an unmap across the widest range of free blocks 
that includes each new block.

        No longer has a delay on mount, and it looks like we still wind up 
freeing the majority of the space, too. It's still ideal to choose an 
appropriate volblocksize and then use the command line tools to format the 
volume with matching blocksize. However this solves the issues experienced when 
using the GUI Disk Utility to format a zvol with the default 8k volblocksize or 
larger.

- Evan

> On Dec 18, 2014, at 2:19 PM, Evan Susarret <[email protected]> wrote:
> 
> Matt,
> 
>> On Dec 17, 2014, at 10:45 PM, Matthew Ahrens <[email protected]> wrote:
>> 
>> I don't understand what this has to do with devices that support DISCARD / 
>> UNMAP, such as SSDs.  The question is about freeing part of a ZVOL in 
>> response to UNMAP/DISCARD requests from HFS+, right?  The end result being 
>> that (maybe) the ZVOL uses less space.  ZFS is not issuing any 
>> UNMAP/DISCARDs to the underlying device (e.g. SSD).
>> 
>> I guess I don't really understand why HFS+ is issuing all these UNMAPs on 
>> mount.  Does it do that on other storage?  E.g. HFS+ on a SSD (no ZFS 
>> involved).  Is it sending a zillion TRIM commands every time my laptop 
>> boots, or I plug in an external SSD?
> 
>       To answer your question - the mount function for HFS+ volumes 
> intentionally searches for free ranges and issues Unmaps. This happens every 
> mount (except when mounting a root filesystem). This may seem like overkill, 
> but with (Apple-approved) SSD hardware it typically isn’t even noticeable. 
> See below for links from the xnu kernel sources on the Apple open source 
> website.
> 
> From Apple opensource xnu-2782.1.97:
> 
> hfs_vfsutils.c: hfsplus_mount starts a thread for hfs_scan_blocks
> http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsutils.c
> 
> hfs_vfsops.c: hfs_scan_blocks calls ScanUnmapBlocks
> http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfs_vfsops.c
> 
> VolumeAllocation.c: ScanUnmapBlocks issues TRIM for any unused ranges within 
> the volume’s physical extents
> http://www.opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/hfs/hfscommon/Misc/VolumeAllocation.c
> 
> As commented on line 96 of VolumeAllocation.c:
> /*
> ScanUnmapBlocks       
>                                       Traverse the entire allocation bitmap.  
> Potentially issue DKIOCUNMAPs to the device as it 
>                                       tracks unallocated ranges when 
> iterating the volume bitmap.  Additionally, build up the in-core
>                                       summary table of the allocation bitmap.
> */
> 
>> 
>> --matt
>> 
>>> On Wed, Dec 17, 2014 at 10:21 PM, Jorgen Lundman <[email protected]> 
>>> wrote:
>>> 
>>> 
>>> Matthew Ahrens wrote:
>>> > I think you're saying you have a zvol with HFS+ on top, and that when you
>>> > mount the HFS+ volume, it sends a lot of unmap requests to the zvol, which
>>> > is slow.
>>> >
>>> > Before we get into complicated solutions, I have some stupid questions:
>>> >
>>> >  - Why does it need to issue a zillion unmaps every time you mount?
>>> 
>>> Ask Apple!
>>> 
>>> But yes, I suppose if you have a device that supports DISCARDs, and you
>>> (the user) wants to receive unmaps, XNU kernel goes through all empty areas
>>> on mount and issues unmap for each. Even for areas that have already been
>>> discarded, it seems.
>>> 
>>> I don't think we can detect those unmap requests for already unmapped
>>> areas? Easily?
>>> 
>>> 
>>> >
>>> >  - Could you just ignore the UNMAPs? (obvious answer is yes, but does it
>>> > hurt anything else)
>>> 
>>> Of course you can. Don't buy SSDs is one way! Disabling unmap support in OS
>>> also works. But I guess they added unmap to devices, and operating systems,
>>> for a reason, and should the user want them enabled, 20 mins to mount is
>>> undesirable.. So we were checking to see if we could do something trivial
>>> to lessen the effect.  People running the experimental code report
>>> happiness, but that doesn't mean it's correct :)
> 
> AFAIK HFS is the only OS X filesystem actively using unmap, but I honestly do 
> not know whether others do. Either way, the freed blocks are not guaranteed 
> to be zero’d, only marked as free. So it *should* be safe to drop or ignore 
> unmaps...
> 
>>> 
>>> 
>>> >
>>> >  - Do you have this unmap performance fix?
>>> >     4873 zvol unmap calls can take a very long time for larger datasets
>>> 
>>> Yes. Makes little difference. It is more the high number of commits that
>>> take a while to go through.
>>> 
>>> I suspect only OSX and FreeBSD have this concern, as the other platforms do
>>> not yet fully support devices with DISCARD. But I don't know for sure.
> 
>       There are several issues at play here, and I believe there is not an 
> easy one size fits all solution. There are a few improvements that we could 
> make, as well as providing solid documentation to help users configure zvols 
> properly upfront.
> 
>       Back in May-June, I worked together with Jorgen on implementing zvol 
> unmap. I originally modeled it after existing illumos code for the DKIOCFREE 
> ioctl.
>       The space savings are significant, especially for un-compressed zvols. 
> Without unmap, HFS just marks blocks as free in metadata, but does not zero 
> the block. Zvols would grow until the Used space was close to the Volsize, 
> and a manual run of Disk Utility's "Erase Free Space” (or other methods) was 
> needed to reclaim the unused space. The space used by incremental snapshots 
> is also significantly smaller. See this link for some numbers from zfs list:
> https://github.com/openzfsonosx/zfs/pull/199#issue-36229857
> 
>       Mac OS X has issues with zvol volblocksize, or more generally, with any 
> block devices that have a blocksize larger than 8k. Jorgen, myself, and 
> others have experimented with this - Disk Utility as well as command line 
> tools can partially or completely fail to address the device. In the distant 
> past this meant that zvols created by (or send/recv from) other OpenZFS 
> implementations were incompatible - not recognizing partition maps or data, 
> and vice versa. We have since resolved/worked around that issue by setting 
> the logical blocksize to 512b.
>       Ideally we should advertise 512b as the Logical blocksize, and also 
> advise the OS of the Physical volblocksize. However we have not had much 
> success with setting this and getting Disk Utility and other tools to 
> recognize and use it. Currently the logical blocksize of 512b seems to be all 
> we can easily set, but at least this maximizes compatibility.
> 
>       Because of that issue, certain settings for volblocksize can exhibit 
> poor performance. On the whole, a zvol created with `zfs create -b 128k 
> tank/zvol` seems to perform OK. Compression ratios are often better with 
> higher blocksizes, which makes it tempting to create zvols with high 
> volblocksizes. For advanced users it may be obvious to use a smaller 
> blocksize, but for the average user it is very confusing. (e.g. "zfs datasets 
> use 128k as the default recordsize, why use 8k for a zvol?”).
>       Additionally, the GUI tools are not aware of the zvol’s volblocksize, 
> so it is ideal to use the command line `newfs_hfs` instead. For example, when 
> using the zfs default of volblocksize=8k, Disk Utility will create an HFS 
> volume that uses a 4k blocksize. As a result, unmaps are often being issued 
> for half a block, and only sometimes are aligned to the volblocksize 
> correctly.
> 
>       That’s where the blocksize issue is intertwined with unmap performance- 
> as noted by ZoL, unmaps that are not aligned to the volblocksize result in 
> read-modify-write cycles that (slowly) zeroes out ranges within a block:
> 
>       Matt - does that seem accurate? I’ve looked at dmu_free_long_range, 
> etc, but with so many layers I find the exact behavior to be obscured. Also I 
> don’t see this in ‘spindump' backtraces while the unmaps are being processed.
> 
>       To alleviate this on ZoL, zvol_discard() rounds the offset/size (down) 
> to the volblocksize, and rejects requests that end up with zero-size after 
> that. We probably should do the same. This alone could solve the slow 
> mount-times myself and others are experiencing, by avoiding 
> read/modify/write. However it may also reduce the space-savings seen.
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L661-L673
> 
>       ZoL also advertises a discard_granularity that matches the 
> volblocksize, which should prevent unaligned unmaps from being issued in the 
> first place. I’m not sure that we have the ability to do so on OS X, but I’ll 
> have to recheck in IOKit and the xnu kernel source. I believe we will have an 
> ‘advisement’ property, like the Physical blocksize property, that may or may 
> not be respected by the kernel.
> 
>       Re: the original question, I should also note that the unmaps are being 
> issued from HFS to the zvol correctly, whether threaded or not (they are 
> being issued by a background thread anyway). I have system logs that show 
> that all 80,000 unmaps are typically issued in under 1-3 minutes. The 
> slowness is mostly in processing them - 30 minutes each time the volume is 
> mounted. By running spindump/dtrace and other tools, it appears that 
> zil_commit and dmu_free_long_range are taking most of that time, of course.
>       Experimenting with sync=disabled, and with another branch that enables 
> the ZVOL_WCE flag (with sync=standard), it consistently takes only 10 minutes 
> to complete.
> 
>       Matt, any pointers on whether or not to use the ZVOL_WCE write cache 
> flag? It is disabled by default, and in our current implementation cannot 
> even be enabled by ioctl (present in zvol.c, but we don’t create the block 
> device in that way).
> This translates to zil_commit being called for every unmap, regardless of the 
> ‘sync’ property:
> https://github.com/openzfsonosx/zfs/blob/master/module/zfs/zvol.c#L1926-L1936
> 
> /*
>  * If the write-cache is disabled or 'sync' property
>  * is set to 'always' then treat this as a synchronous
>  * operation (i.e. commit to zil).
>  */
> if (!(um->zv->zv_flags & ZVOL_WCE) ||
>     (um->zv->zv_objset->os_sync == ZFS_SYNC_ALWAYS)) {
>               zil_commit(um->zv->zv_zilog, ZVOL_OBJ);
> }
> 
> BTW, ZoL is currently not using a transaction or zil_commit:
> https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zvol.c#L675-L683
> 
>       We also do not have dmu_tx_mark_netfree(), which I see was added to 
> DKIOCFREE on illumos, but I see that this would only be helpful in a full 
> pool/quota situation.
> https://github.com/illumos/illumos-gate/commit/4bb73804952172060c9efb163b89c17f56804fe8
> 
>       I’ve also been partial to adding knobs and/or zvol properties that 
> allow users to enable or disable unmap functionality. This would allow people 
> to alter the behavior to suit their needs. A system-wide setting makes sense, 
> but I think per-zvol would be useful, too.
> 
> Any suggestions you may have are welcome!
> Thank you,
> Evan Susarret
> 
>>> 
>>> 
>>> --
>>> Jorgen Lundman       | <[email protected]>
>>> Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
>>> Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
>>> Japan                | +81 (0)3 -3375-1767          (home)
>>> _______________________________________________
>>> developer mailing list
>>> [email protected]
>>> http://lists.open-zfs.org/mailman/listinfo/developer
>> _______________________________________________
>> developer mailing list
>> [email protected]
>> http://lists.open-zfs.org/mailman/listinfo/developer
>

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] RFC: ZVOL unmap (DISCARD) delay relief

Reply via email to