On Tuesday, August 2, 2016, Ilya Dryomov <[email protected]> wrote:

> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev <[email protected]
> <javascript:;>> wrote:
> > On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin <[email protected]
> <javascript:;>> wrote:
> >> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
> >>> Hi Ilya,
> >>>
> >>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov <[email protected]
> <javascript:;>> wrote:
> >>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <
> [email protected] <javascript:;>> wrote:
> >>>>> RBD illustration showing RBD ignoring discard until a certain
> >>>>> threshold - why is that?  This behavior is unfortunately incompatible
> >>>>> with ESXi discard (UNMAP) behavior.
> >>>>>
> >>>>> Is there a way to lower the discard sensitivity on RBD devices?
> >>>>>
> >>> <snip>
> >>>>>
> >>>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> >>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> >>>>> print SUM/1024 " KB" }'
> >>>>> 819200 KB
> >>>>>
> >>>>> root@e1:/var/log# blkdiscard -o 0 -l 40960000 /dev/rbd28
> >>>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> >>>>> print SUM/1024 " KB" }'
> >>>>> 782336 KB
> >>>>
> >>>> Think about it in terms of underlying RADOS objects (4M by default).
> >>>> There are three cases:
> >>>>
> >>>>     discard range       | command
> >>>>     -----------------------------------------
> >>>>     whole object        | delete
> >>>>     object's tail       | truncate
> >>>>     object's head       | zero
> >>>>
> >>>> Obviously, only delete and truncate free up space.  In all of your
> >>>> examples, except the last one, you are attempting to discard the head
> >>>> of the (first) object.
> >>>>
> >>>> You can free up as little as a sector, as long as it's the tail:
> >>>>
> >>>> Offset    Length  Type
> >>>> 0         4194304 data
> >>>>
> >>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
> >>>>
> >>>> Offset    Length  Type
> >>>> 0         4193792 data
> >>>
> >>> Looks like ESXi is sending in each discard/unmap with the fixed
> >>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
> >>> is a slight reduction in size via rbd diff method, but now I
> >>> understand that actual truncate only takes effect when the discard
> >>> happens to clip the tail of an image.
> >>>
> >>> So far looking at
> >>> https://kb.vmware.com/selfservice/microsites/search.
> do?language=en_US&cmd=displayKC&externalId=2057513
> >>>
> >>> ...the only variable we can control is the count of 8192-sector chunks
> >>> and not their size.  Which means that most of the ESXi discard
> >>> commands will be disregarded by Ceph.
> >>>
> >>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
> >>>
> >>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
> >>> 1342099456, nr_sects 8192)
> >>
> >> Yes, correct. However, to make sure that VMware is not (erroneously)
> enforced to do this, you need to perform one more check.
> >>
> >> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here
> correct granularity and alignment (4M, I guess?)
> >
> > This seems to reflect the granularity (4194304), which matches the
> > 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> > value.
> >
> > Can discard_alignment be specified with RBD?
>
> It's exported as a read-only sysfs attribute, just like
> discard_granularity:
>
> # cat /sys/block/rbd0/discard_alignment
> 4194304


> Is there a way to perhaps increase the discard granularity?  The way I see
it based on the discussion so far, here is why discard/unmap is failing to
work with VMWare:

- RBD provides space in 4MB blocks, which must be discarded entirely, or at
least hitting the tail.

- SCST communicates to ESXi that discard alignment is 4MB and discard
granularity is also 4MB

- ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free
anything

What is it were possible to make a 6MB discard granularity?

Thank you,
Alex

>
>
> Thanks,
>
>                 Ilya
>


-- 
--
Alex Gorbachev
Storcium
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to