On Wed, Dec 4, 2013 at 6:58 PM, Saso Kiselkov <[email protected]>wrote:

> On 12/4/13, 11:25 PM, Matthew Ahrens wrote:
> > Saso, thanks for your continued work to clean this up.  Here are some
> > more comments:
>
> Hi Matt,
>
> Thanks for the review, responses below:
>
> >  - terminology: let's use similar terms as with the intent log: there
> > are not several logs on each l2arc device.  There is one log composed of
> > a linked list of log blocks.  (So, s/log/log block/ in the code/comments)
>
> Ok, will build this in.
>
> >  - why do we need to persist the L2ARC_PERSIST_FLAGS? (e.g. L2CACHE -
> > aren't they all in l2arc)
>
> I'm not sure I understand. Do you mean that we should *not* rebuild the
> buffer flags when rebuilding buffers? While it is possible that we can
> infer two of the flag values (ARC_IN_HASH_TABLE and ARC_L2CACHE), we
> can't certainly say what the right value for ARC_L2COMPRESS and
> ARC_PREFETCH should be (as these relate to the state the pool was in
> when the buffer got into ARC in the first place).
>

I'm suggesting that we rebuild the flags, but I'm not sure the benefit of
storing these flags on disk.  It seems like ARC_IN_HASH_TABLE and
ARC_L2CACHE should always be set, so don't need to be persisted.  Why
should ARC_PREFETCH ever be set when we reconstruct?  Can't we set
ARC_L2COMPRESS based on LE_GET_COMPRESS()?


>
> >  - L2ARC_CHECK_REBUILD_TIMEOUT - unused macro
>
> Oops, a bit of obsolete code that I forgot to cut out after the first
> round of refactoring.
>
> >  - l2arc_rebuild_timeout: comment implies that rebuild happens before
> > import completes; i thought it rebuilt in background?
>
> While rebuild does happen in the background, it is quite a heavy
> operation that can make the system rather busy, so I wanted to make sure
> it is bounded. I've coded a non-trivial amount of real-time code in the
> past and it's just become kind of a reflex for me.
>

Thankfully this is not realtime code :-)


>
> >  - calling it an uberblock is confusing (e.g. line 4415); maybe
> > l2arc_dev_hdr_t?
>
> Ok.
>
> >  - reconstruct time is not that critical. We could simplify the code and
> > on-disk format, and make it easier to convince ourselves that the code
> > is correct, if we remove the prefetch & double pointer stuff from
> > reconstruct code.  It should be sufficient to issue one i/o at a time
> > (rather than 2), perhaps issuing the i/o for the next zio while creating
> > l2arc_hdr_t’s for this blocks' entries.
>
> Unfortunately, I did experiments with that and rebuild does suffer quite
> a bit (by about an order of magnitude slower) unless you get the
> prefetching *just* right (i.e. keep the device busy all the time).
> Reconstructing one log block's buffers in memory takes a miniscule
> amount of time when compared to access latencies even on SSDs, so
> without prefetching, the rebuild will be mostly read latency limited. My
> tests seemed to indicate that an average-size L2ARC rebuild (with the
> prefetching that's in the webrev) should take on the order of 10-20
> seconds. Without that, you're looking at probably several minute's worth
> of churn on the L2ARC device.
>

I must not be understanding the code then (which is why I would prefer it
to be simpler).  It looks like there are at most 2 concurrent i/os going
on, one for each of the 2 linked lists.  So how can it go more than 2x the
performance of a single linked list with 1 concurrent i/o?


>
> >  - how can there be infinite loops of logs?  validity should be
> > determined by not straddling invalid range (between end and begin)
>
> Aside from obvious manually fabricated infinite log block loops to
> affect a hang or DoS, buffers can loop due to design. The main reason is
> that log blocks point backwards, while our writing goes forwards. We
> have thus no way to back and amend "outdated" buffers. So if we get
> seriously unlucky, we could, potentially, write a log buffer in exactly
> the sector offset pointed to by an ancient previously written log block,
> which would then create the loop. It's extremely unlikely, but it's a
> possibility.
>

But can't you catch that by validating that the pointer does not cross the
invalid range of the device (l2u_start_lps[0].l2lp_daddr and
l2u_evict_tail)?


>
> >  - Consider storing the  “uberblock” in the MOS - then don’t have to
> > worry about atomic overwrite, self-checksum, magic number.  Update the
> > in-memory version as we do now, and just write it into the MOS every txg
> > (if it's dirty).
>
> While it would remove the aspects you mention, I suspect it wouldn't
> really simplify the code. First, we'd need to dump a bunch of callbacks
> in txg commit places to make sure we check for changes. Next, we'd need
> to construct an even more complicated data structure in the MOS, simply
> by virtue of having the possibility of having multiple L2ARC devices.
> Finally, I'd have to hook into places where "zpool add" and "zpool
> remove" add L2ARC vdevs. While none of these are particularly hard, I'm
> not convinced it'd be worthwhile simply to replace two 50-odd lines long
> functions. We already store a bit of metadata in the vdev config tree
> that tells us whether we're supposed to be looking for an uberblock or not.
>
> Anyway, gonna give it some thought and try to estimate the complexity -
> I might be completely off here.
>

You might be right.  Also it may be that keeping it in the MOS would
increase complexity because then the l2uberblock (aka l2arc_dev_hdr_t) would
not reflect new writes that had happened, which potentially could overwrite
data that the MOS's uberblock thinks are valid.


>
> Appreciate you taking the time to review and looking forward to your
> responses.
>
> Cheers,
> --
> Saso
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
> Modify Your Subscription:
> https://www.listbox.com/member/?member_id=21635000&id_secret=21635000-73dc201a
> Powered by Listbox: http://www.listbox.com
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to