On Mon, Sep 15, 2014 at 6:55 AM, Andriy Gapon <[email protected]> wrote:

> On 30/03/2014 02:33, Matthew Ahrens wrote:
> >
> >
> >
> > On Mon, Mar 3, 2014 at 1:17 AM, Andriy Gapon <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >     I noticed that on some of our systems we were getting a clearly
> abnormal number
> >     of l2arc checksum errors accounted in l2_cksum_bad.  The hardware
> appeared to be
> >     in good health.  Using DTrace I noticed that the data seemed to be
> overwritten
> >     with other data.  After more DTrace analysis I observed that
> sometimes
> >     l2arc_write_buffers() would advance l2ad_hand by more than target_sz.
> >     This meant that l2arc_write_buffers() would write beyond a region
> cleared by
> >     l2arc_evict() and thus overwrite data belonging to non-evicted
> buffers.  Havoc
> >     ensues.
> >
> >     The cache devices in question are all SSDs with logical sector size
> of 4KB.
> >     I am not sure about other ZFS platforms, but on FreeBSD this fact is
> detected
> >     and ashift of 12 is used for the cache vdevs.
> >
> >     Looking at l2arc_write_buffers() code you can see that it properly
> accounts for
> >     a vdev ashift when actually writing buffers and advancing l2ad_hand:
> >                             /*
> >                              * Keep the clock hand suitably
> device-aligned.
> >                              */
> >                             buf_p_sz =
> vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
> >                             write_psize += buf_p_sz;
> >                             dev->l2ad_hand += buf_p_sz;
> >
> >     But the same is not done when selecting buffers to be written and
> checking that
> >     target_sz is not exceeded.
> >     So, if ARC contains a lot of buffers smaller than 4K that means that
> on-disk
> >     size of the L2ARC buffers could be quite larger than their logical
> size.
> >
> >     I propose the following patch which has been tested and seems to fix
> the problem
> >     without introducing any side effects:
> >
> https://github.com/avg-I/freebsd/compare/review;l2arc-write-target-size.diff
> >
> https://github.com/avg-I/freebsd/compare/review;l2arc-write-target-size
> >
> >
> > Your analysis and fix look correct to me.
> >
> > I think you have misnamed "buf_p_sz".  It is the asize (Allocation
> Size), not
> > the psize.  Thus a more appropriate name would be "buf_asize" or
> "buf_a_sz".
> >  Likewise, you should create a new variable "write_asize" to use instead
> of
> > "write_psize" here.  (looks like you've copied the existing incorrect
> name from
> > line 5133; that should be renamed too)
> >
>
> Sorry for taking so long to get back.
> I looked at the code and it seems that there is a potential for a small
> confusion with "asize" there.  In some cases it means "allocation size" as
> you
> point out, but in other cases it seems to stand for something like "actual
> size".  For example, b_asize of l2arc_buf_hdr_t stores a size of
> b_tmp_cdata,
> which could be equal to b_size if the buffer is not compressed or to some
> smaller value otherwise.
> Also, arcstat_l2_size accounts logical size of data in L2ARC,
> arcstat_l2_asize
> accounts "actual" size of data adjusted for compression  -- in a sense,
> should
> this be called physical size instead?  No kstat currently accounts for
> allocated / on-disk size of data, which could be greater than
> arcstat_l2_asize
> because of rounding up to a sector size.
>
> Could you please clarify the canonical nomenclature of size, psize and
> asize?
> I will the adjust the code to have proper variable names.
> Perhaps, arcstat_l2_asize should actually account allocated size as well?
> I
> think that space usage for an L2 device should definitely reflect allocated
> size.  That is, integral of vdev_psize_to_asize(b_asize) as opposed to
> just b_asize.
>
>
Here is the logic as it applies to block pointers.  The L2ARC doesn't use
that code path, but I think it should be clear how the concepts apply to
the L2ARC blocks.

size -- I would assume "logical size" (aka "lsize") -- the amount of space
as seen by the consumer, i.e. before compression or any rounding.  A
multiple of SPA_MINBLOCKSIZE (512 bytes), and typically (but not
necessarily) a power of 2.  E.g. 128KB

psize -- "physical" size -- the size after compression.  A multiple of
SPA_MINBLOCKSIZE (512 bytes).  If compression was enabled, typically not a
power of 2.  E.g. 34.5KB

asize -- "allocated" size -- the amount of space that was allocated from
the top-level VDEV.  Includes parity and rounding for RAID-Z, and rounding
up to the sector size ("ashift"), which is often 4KB.  A multiple of the
sector size (typically either 512B or 4KB).  Block Pointers' asize includes
the "ditto blocks" (i.e. it is the sum of the asize of its (up to 3)
constituent DVAs).  E.g. 36KB.

--matt
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to