Re: [rfc] small bioq patch

Maksim Yevmenkin Tue, 15 Oct 2013 11:16:34 -0700

On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney <[email protected]> wrote:
> Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
>> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <[email protected]> wrote:
>> >
>> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
>> >> i would like to submit the attached bioq patch for review and
>> >> comments. this is proof of concept. it helps with smoothing disk read
>> >> service times and arrear to eliminates outliers. please see attached
>> >> pictures (about a week worth of data)
>> >>
>> >> - c034 "control" unmodified system
>> >> - c044 patched system
>> >
>> > Can you describe how you got this data?  Were you using the gstat
>> > code or some other code?
>>
>> Yes, it's basically gstat data.
>
> The reason I ask this is that I don't think the data you are getting
> from gstat is what you think you are...  It accumulates time for a set
> of operations and then divides by the count...  So I'm not sure if the
> stat improvements you are seeing are as meaningful as you might think
> they are...


yes, i'm aware of it. however, i'm not aware of "better" tools. we
also use dtrace and PCM/PMC. ktrace is not particularly useable for us
because it does not really work well when we push system above 5 Gbps.
in order to actually see any "issues" we need to push system to 10
Gbps range at least.

>> >> graphs show max/avg disk read service times for both systems across 36
>> >> spinning drives. both systems are relatively busy serving production
>> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
>> >> represent time when systems are refreshing their content, i.e. disks
>> >> are both reading and writing at the same time.
>> >
>> > Can you describe why you think this change makes an improvement?  Unless
>> > you're running 10k or 15k RPM drives, 128 seems like a large number.. as
>> > that's about halve number of IOPs that a normal HD handles in a second..
>>
>> Our (Netflix) load is basically random disk io. We have tweaked the system 
>> to ensure that our io path is "wide" enough, I.e. We read 1mb per disk io 
>> for majority of the requests. However offsets we read from are all over the 
>> place. It appears that we are getting into situation where larger offsets 
>> are getting delayed because smaller offsets are "jumping" ahead of them. 
>> Forcing bioq insert tail operation and effectively moving insertion point 
>> seems to help avoiding getting into this situation. And, no. We don't use 
>> 10k or 15k drives. Just regular enterprise 7200 sata drives.
>
> I assume that the 1mb reads are then further broken up into 8 128kb
> reads? so it's more like every 16 reads in your work load that you
> insert the "ordered" io...

i'm not sure where 128kb comes from. are you referring to
MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB.

> I want to make sure that we choose the right value for this number..
> What number of IOPs are you seeing?

generally we see < 100 IOPs per disk on a system pushing 10+ Gbps.
i've experimented with different numbers on our system and i did not
see much of a difference on our workload. i'm up a value of 1024 now.
higher numbers seem to produce slightly bigger difference between
average and max time, but i do not think its statistically meaningful.
general shape of the curve remains smooth for all tried values so far.

[...]

>> > Also, do you see a similar throughput of the system?
>>
>> Yes. We do see almost identical throughput from both systems.  I have not 
>> pushed the system to its limit yet, but having much smoother disk read 
>> service time is important for us because we use it as one of the components 
>> of system health metrics. We also need to ensure that disk io request is 
>> actually dispatched to the disk in a timely manner.
>
> Per above, have you measured at the application layer that you are
> getting better latency times on your reads?  Maybe by doing a ktrace
> of the io, and calculating times between read and return or something
> like that...

ktrace is not particularly useful. i can see if i can come up with
dtrace probe or something. our application (or rather clients) are
_very_ sensitive to latency. having read service times outliers is not
very good for us.

> Have you looked at the geom disk schedulers work that Luigi did a few
> years back?  There have been known issues w/ our io scheduler for a
> long time...  If you search the mailing lists, you'll see lots of
> reports from some processes starving out others, probably due to a
> similar issue...  I've seen similar unfair behavior between processes,
> but spend time tracking it down...

yes, we have looked at it. it makes things worse for us, unfortunately.

> It does look like a good improvement though...
>
> Thanks for the work!

ok :) i'm interested to hear from people who have different workload
profile. for example lots of iops, i.e. very small files reads or
something like that.

thanks,
max
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[email protected]"

Re: [rfc] small bioq patch

Reply via email to