Looking through the comment in zfs_zone.c: 51: you start talking about an "ongoing average" but don't say what of. This sentence might make more sense after you explain the utilization metric (which I assume is what the average is of).
63: when does this increment or decrement of the delay happen? every second? every syscall? 72: usage vs utilization -- if they are the same, use the same word throughout 86: "not taken into account" meaning their delay is zero? If so, say that. 131: why ARGSUSED here but not on the other funcs? 146: "real" isn't very informative. perhaps "kernel"? 165: "recalculated" = incremented or decremented by zfs_zone_delay_step? 175: does this apply to both reads and writes? Seems that writes would not be as effected by random vs sequential, because often the syscall doesn't issue any reads (e.g. block-aligned write where the indirect blocks are cached) 177: what does "system" mean here? is there also "user" latency? 178: The utilization-based algorithm you outlined above doesn't work? This comment is very confusing to me. 182: I continue to be confused. reads vs writes is a non-sequitor here. You're talking about latency here, not utilization? I thought the idea behind the throttle was that individual syscall latency isn't important, it's utilization (number syscalls * average latency)? 182: "system" -> "system call"? 184: how does this tunable prevent it? what is being limited? 194: Are you saying that the definition of utilization from line 54 corresponds closely with the %b of "the disk"? Assuming a single-disk pool? How do async writes factor into that? seems like they would not count much towards "utilization" (def. line 54), but they do count towards disk %b. How would multi-disk pools factor into this? Are you talking about the pool's %b? Or all disks in the pool's %b > this? 197: system average of what? 198: zone count of what? 207: this comment seems at odds with line 164. What variable controls how often the delays are recalculated? 242-257: it seems like these should all be per-pool, at least. Can you explain how this works if I have multiple pools? 243: still seems like you're assuming there's one disk in the system. How do you account for multiple pools? If zone A is hammering pool X, then its writes to pool Y will also be delayed? What if I have 2 zones each hammering their own pool? 252: What if I have 2 pools, and they are being synced at different rates? will this average the rates? 268: The comment described the behavior without mentioning this variable. So what does it do? 281: above normal what? sync time? I assume normal is defined by past observed behavior? 286: what does this variable do? On Thu, Oct 3, 2013 at 1:01 PM, Jerry Jelinek <[email protected]>wrote: > Matt, > > Thanks for your initial comments on this. I finally had a chance to go > through > these and address them. I know you wanted to spend more time looking at > the code, but in the meantime I put up a new webrev which has a lot more > comments > and hopefully makes things easier to understand. The major changes are all > in zfs_zone.c The new webrev is at: > > > http://us-east.manta.joyent.com/jjelinek/public/webrevs/zfs-zone-io-throttle-2/index.html > > The previous webrev is still in place if needed. > > I have a few specific responses in-line below as well. For everything > where you asked > for more/better comments, that has been done. > > On Sat, Sep 14, 2013 at 9:17 PM, Matthew Ahrens <[email protected]>wrote: > >> I think that zfs_zone.c could benefit from a "big theory statement" >> explaining what the throttle & queueing modifications attempt to achieve >> and how it goes about doing that. E.g. how the delay is calculated, and >> how the priority of zones is calculated and how it effects queueing. See >> the comment about the existing (per-pool) write throttle in dsl_pool.c for >> an example. >> > > There is a new big theory comment now. > > I am curious if this could be adapted to work with the zone-like >> capabilities of the other platforms that ZFS runs on (FreeBSD, Linux). >> CC'ing devs from those platforms. >> >> The individual tuneables seem well documented, but some of their >> relevance is hard to discern without an explanation of the whole system. >> For example, the comment above zfs_zone_rw_lat_limit -- it is unclear >> why such a problem would arise, without having some idea of how the delay >> is calculated. >> >> Another example: it seems that in several places you may be assuming that >> there is only one resource being contended (which you call "the disk" or >> "the queue"). In reality there may be many disks, each with their own >> (per-io-type) queues, which are part of several pools. A comment could >> explain how your algorithms behave in the presence of multiple disks and >> pools. >> >> Some specific comments follow, but I will need more time to understand >> the code. >> >> zio.h: can you put all the timestamps next to each other in the struct >> (i.e. io_timestamp next to the ones you're adding). How is io_start >> different from io_timestamp? >> > > Done. I looked at this and I think they are equivalent so I switched the > code to use io_timestamp. > > zfs_zone.c: >> 2: you may wish to use the "version 1.0 only" copyright text, with the >> link to illumos.org rather than a website which no longer exists. See >> mdb_typedef.c (copyright joyent!) for an example. >> >> 103: what are the units for zfs_zone_delay_step and >> zfs_zone_delay_ceiling? >> >> 143: comment for zfs_zone_adjust_time? >> >> 166: are you assuming that there is one disk in the system? >> > > no, added comments to clarify all of these values > > >> >> 183: "zone scheduling" is the delay that this code inserts? >> > > no, this is for enabling/disabling the scheduler. this should be clearer > now > with the "big theory" comment > > >> >> 195: anomalous capitalization of TXG. Assume you mean "sync", we don;t >> use the term "flush" elsewhere. >> >> 197: comment explaining what the "scale" is and when to change it? >> >> 217: what do you mean by "number of ops"? currently pending ops? >> average ops per second? >> >> 229: comment says returns true/false; code returns a time delta (in >> microseconds in a hrtime_t? good chance for confusion there.) >> > > Corrected. > > >> >> 321: fix formatting >> >> 330: should return boolean_t and use B_TRUE/B_FALSE. >> > > done > > >> >> 723: magic number 20? >> > > changed to a #define with a comment explaining > > >> >> 771: each zone knows how many ops it has in "the" queue? You mean in >> "each" queue? There is a queue for each leaf vdev (and each type). >> > > Right. This is implied in the first paragraph of the block comment but I > added > some more text to that paragraph to try to clarify that. > > >> >> zone.h: 539: the comment implies that zone_wr_ops tracks "writes and >> logical writes", but I think it just tracks write i/os, and zone_lwr_ops >> tracks logical writes. >> > > Thanks again, > Jerry > > > >> On Fri, Aug 30, 2013 at 1:09 PM, Jerry Jelinek >> <[email protected]>wrote: >> >>> I think we put this out here for code review a couple of years >>> ago but it fell through the cracks. Matt's recent changes for: >>> >>> 4045 zfs write throttle & i/o scheduler performance work >>> >>> prompted us put a new illumos-gate repo together for our throttle >>> changes. There is a webrev at: >>> >>> >>> http://us-east.manta.joyent.com/jjelinek/public/webrevs/zfs-zone-io-throttle/index.html >>> >>> This is our latest code which has been merged and updated to >>> work with Matt's changes for 4045. >>> >>> Let me know if you have any comments and if this looks like >>> something that could get upstreamed. >>> >>> Thanks, >>> Jerry >>> >>> *illumos-developer* | >>> Archives<https://www.listbox.com/member/archive/182179/=now> >>> <https://www.listbox.com/member/archive/rss/182179/21175174-cd73734d> | >>> Modify<https://www.listbox.com/member/?member_id=21175174&id_secret=21175174-792643f6>Your >>> Subscription >>> <http://www.listbox.com> >>> >> >> >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
