O throttle and scheduling

Matthew Ahrens Sun, 06 Oct 2013 19:46:59 -0700

Looking through the comment in zfs_zone.c:

51: you start talking about an "ongoing average" but don't say what of.
 This sentence might make more sense after you explain the utilization
metric (which I assume is what the average is of).

63: when does this increment or decrement of the delay happen?  every
second?  every syscall?

72: usage vs utilization -- if they are the same, use the same word
throughout

86: "not taken into account" meaning their delay is zero?  If so, say that.

131: why ARGSUSED here but not on the other funcs?

146: "real" isn't very informative.  perhaps "kernel"?

165: "recalculated" = incremented or decremented by zfs_zone_delay_step?

175: does this apply to both reads and writes?  Seems that writes would not
be as effected by random vs sequential, because often the syscall doesn't
issue any reads (e.g. block-aligned write where the indirect blocks are
cached)

177: what does "system" mean here?  is there also "user" latency?

178: The utilization-based algorithm you outlined above doesn't work?  This
comment is very confusing to me.

182: I continue to be confused.  reads vs writes is a non-sequitor here.
 You're talking about latency here, not utilization?  I thought the idea
behind the throttle was that individual syscall latency isn't important,
it's utilization (number syscalls * average latency)?

182: "system" -> "system call"?

184: how does this tunable prevent it?  what is being limited?

194: Are you saying that the definition of utilization from line 54
corresponds closely with the %b of "the disk"?  Assuming a single-disk
pool?  How do async writes factor into that?  seems like they would not
count much towards "utilization" (def. line 54), but they do count towards
disk %b.  How would multi-disk pools factor into this?  Are you talking
about the pool's %b?  Or all disks in the pool's %b > this?

197: system average of what?

198: zone count of what?

207: this comment seems at odds with line 164.  What variable controls how
often the delays are recalculated?

242-257: it seems like these should all be per-pool, at least.  Can you
explain how this works if I have multiple pools?

243: still seems like you're assuming there's one disk in the system.  How
do you account for multiple pools?  If zone A is hammering pool X, then its
writes to pool Y will also be delayed?  What if I have 2 zones each
hammering their own pool?

252: What if I have 2 pools, and they are being synced at different rates?
 will this average the rates?

268: The comment described the behavior without mentioning this variable.
 So what does it do?

281: above normal what?  sync time? I assume normal is defined by past
observed behavior?

286: what does this variable do?

On Thu, Oct 3, 2013 at 1:01 PM, Jerry Jelinek <[email protected]>wrote:

> Matt,
>
> Thanks for your initial comments on this. I finally had a chance to go
> through
> these and address them. I know you wanted to spend more time looking at
> the code, but in the meantime I put up a new webrev which has a lot more
> comments
> and hopefully makes things easier to understand. The major changes are all
> in zfs_zone.c The new webrev is at:
>
>
> http://us-east.manta.joyent.com/jjelinek/public/webrevs/zfs-zone-io-throttle-2/index.html
>
> The previous webrev is still in place if needed.
>
> I have a few specific responses in-line below as well. For everything
> where you asked
> for more/better comments, that has been done.
>
> On Sat, Sep 14, 2013 at 9:17 PM, Matthew Ahrens <[email protected]>wrote:
>
>> I think that zfs_zone.c could benefit from a "big theory statement"
>> explaining what the throttle & queueing modifications attempt to achieve
>> and how it goes about doing that.  E.g. how the delay is calculated, and
>> how the priority of zones is calculated and how it effects queueing.  See
>> the comment about the existing (per-pool) write throttle in dsl_pool.c for
>> an example.
>>
>
> There is a new big theory comment now.
>
> I am curious if this could be adapted to work with the zone-like
>> capabilities of the other platforms that ZFS runs on (FreeBSD, Linux).
>>  CC'ing devs from those platforms.
>>
>> The individual tuneables seem well documented, but some of their
>> relevance is hard to discern without an explanation of the whole system.
>>  For example, the comment above zfs_zone_rw_lat_limit -- it is unclear
>> why such a problem would arise, without having some idea of how the delay
>> is calculated.
>>
>> Another example: it seems that in several places you may be assuming that
>> there is only one resource being contended (which you call "the disk" or
>> "the queue").  In reality there may be many disks, each with their own
>> (per-io-type) queues, which are part of several pools.  A comment could
>> explain how your algorithms behave in the presence of multiple disks and
>> pools.
>>
>> Some specific comments follow, but I will need more time to understand
>> the code.
>>
>> zio.h: can you put all the timestamps next to each other in the struct
>> (i.e. io_timestamp next to the ones you're adding).  How is io_start
>> different from io_timestamp?
>>
>
> Done. I looked at this and I think they are equivalent so I switched the
> code to use io_timestamp.
>
> zfs_zone.c:
>> 2: you may wish to use the "version 1.0 only" copyright text, with the
>> link to illumos.org rather than a website which no longer exists. See
>> mdb_typedef.c (copyright joyent!) for an example.
>>
>> 103: what are the units for zfs_zone_delay_step and
>> zfs_zone_delay_ceiling?
>>
>> 143: comment for zfs_zone_adjust_time?
>>
>> 166: are you assuming that there is one disk in the system?
>>
>
> no, added comments to clarify all of these values
>
>
>>
>> 183: "zone scheduling" is the delay that this code inserts?
>>
>
> no, this is for enabling/disabling the scheduler. this should be clearer
> now
> with the "big theory" comment
>
>
>>
>> 195: anomalous capitalization of TXG.  Assume you mean "sync", we don;t
>> use the term "flush" elsewhere.
>>
>> 197: comment explaining what the "scale" is and when to change it?
>>
>> 217: what do you mean by "number of ops"?  currently pending ops?
>>  average ops per second?
>>
>> 229: comment says returns true/false; code returns a time delta (in
>> microseconds in a hrtime_t? good chance for confusion there.)
>>
>
> Corrected.
>
>
>>
>> 321: fix formatting
>>
>> 330: should return boolean_t and use B_TRUE/B_FALSE.
>>
>
> done
>
>
>>
>> 723: magic number 20?
>>
>
> changed to a #define with a comment explaining
>
>
>>
>> 771: each zone knows how many ops it has in "the" queue?  You mean in
>> "each" queue?  There is a queue for each leaf vdev (and each type).
>>
>
> Right. This is implied in the first paragraph of the block comment but I
> added
> some more text to that paragraph to try to clarify that.
>
>
>>
>> zone.h: 539: the comment implies that zone_wr_ops tracks "writes and
>> logical writes", but I think it just tracks write i/os, and zone_lwr_ops
>> tracks logical writes.
>>
>
> Thanks again,
> Jerry
>
>
>
>> On Fri, Aug 30, 2013 at 1:09 PM, Jerry Jelinek 
>> <[email protected]>wrote:
>>
>>> I think we put this out here for code review a couple of years
>>> ago but it fell through the cracks. Matt's recent changes for:
>>>
>>> 4045 zfs write throttle & i/o scheduler performance work
>>>
>>> prompted us put a new illumos-gate repo together for our throttle
>>> changes. There is a webrev at:
>>>
>>>
>>> http://us-east.manta.joyent.com/jjelinek/public/webrevs/zfs-zone-io-throttle/index.html
>>>
>>> This is our latest code which has been merged and updated to
>>> work with Matt's changes for 4045.
>>>
>>> Let me know if you have any comments and if this looks like
>>> something that could get upstreamed.
>>>
>>> Thanks,
>>> Jerry
>>>
>>>  *illumos-developer* | 
>>> Archives<https://www.listbox.com/member/archive/182179/=now>
>>> <https://www.listbox.com/member/archive/rss/182179/21175174-cd73734d> |
>>> Modify<https://www.listbox.com/member/?member_id=21175174&id_secret=21175174-792643f6>Your
>>>  Subscription
>>> <http://www.listbox.com>
>>>
>>
>>
>

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [Developer] [developer] code review: zfs/zone I/O throttle and scheduling

Reply via email to