On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney <[email protected]> wrote: > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700: >> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <[email protected]> wrote: >> > >> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700: >> >> i would like to submit the attached bioq patch for review and >> >> comments. this is proof of concept. it helps with smoothing disk read >> >> service times and arrear to eliminates outliers. please see attached >> >> pictures (about a week worth of data) >> >> >> >> - c034 "control" unmodified system >> >> - c044 patched system >> > >> > Can you describe how you got this data? Were you using the gstat >> > code or some other code? >> >> Yes, it's basically gstat data. > > The reason I ask this is that I don't think the data you are getting > from gstat is what you think you are... It accumulates time for a set > of operations and then divides by the count... So I'm not sure if the > stat improvements you are seeing are as meaningful as you might think > they are...
yes, i'm aware of it. however, i'm not aware of "better" tools. we also use dtrace and PCM/PMC. ktrace is not particularly useable for us because it does not really work well when we push system above 5 Gbps. in order to actually see any "issues" we need to push system to 10 Gbps range at least. >> >> graphs show max/avg disk read service times for both systems across 36 >> >> spinning drives. both systems are relatively busy serving production >> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs >> >> represent time when systems are refreshing their content, i.e. disks >> >> are both reading and writing at the same time. >> > >> > Can you describe why you think this change makes an improvement? Unless >> > you're running 10k or 15k RPM drives, 128 seems like a large number.. as >> > that's about halve number of IOPs that a normal HD handles in a second.. >> >> Our (Netflix) load is basically random disk io. We have tweaked the system >> to ensure that our io path is "wide" enough, I.e. We read 1mb per disk io >> for majority of the requests. However offsets we read from are all over the >> place. It appears that we are getting into situation where larger offsets >> are getting delayed because smaller offsets are "jumping" ahead of them. >> Forcing bioq insert tail operation and effectively moving insertion point >> seems to help avoiding getting into this situation. And, no. We don't use >> 10k or 15k drives. Just regular enterprise 7200 sata drives. > > I assume that the 1mb reads are then further broken up into 8 128kb > reads? so it's more like every 16 reads in your work load that you > insert the "ordered" io... i'm not sure where 128kb comes from. are you referring to MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB. > I want to make sure that we choose the right value for this number.. > What number of IOPs are you seeing? generally we see < 100 IOPs per disk on a system pushing 10+ Gbps. i've experimented with different numbers on our system and i did not see much of a difference on our workload. i'm up a value of 1024 now. higher numbers seem to produce slightly bigger difference between average and max time, but i do not think its statistically meaningful. general shape of the curve remains smooth for all tried values so far. [...] >> > Also, do you see a similar throughput of the system? >> >> Yes. We do see almost identical throughput from both systems. I have not >> pushed the system to its limit yet, but having much smoother disk read >> service time is important for us because we use it as one of the components >> of system health metrics. We also need to ensure that disk io request is >> actually dispatched to the disk in a timely manner. > > Per above, have you measured at the application layer that you are > getting better latency times on your reads? Maybe by doing a ktrace > of the io, and calculating times between read and return or something > like that... ktrace is not particularly useful. i can see if i can come up with dtrace probe or something. our application (or rather clients) are _very_ sensitive to latency. having read service times outliers is not very good for us. > Have you looked at the geom disk schedulers work that Luigi did a few > years back? There have been known issues w/ our io scheduler for a > long time... If you search the mailing lists, you'll see lots of > reports from some processes starving out others, probably due to a > similar issue... I've seen similar unfair behavior between processes, > but spend time tracking it down... yes, we have looked at it. it makes things worse for us, unfortunately. > It does look like a good improvement though... > > Thanks for the work! ok :) i'm interested to hear from people who have different workload profile. for example lots of iops, i.e. very small files reads or something like that. thanks, max _______________________________________________ [email protected] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[email protected]"
