Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <[email protected]> wrote:
> >
> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
> >> i would like to submit the attached bioq patch for review and
> >> comments. this is proof of concept. it helps with smoothing disk read
> >> service times and arrear to eliminates outliers. please see attached
> >> pictures (about a week worth of data)
> >>
> >> - c034 "control" unmodified system
> >> - c044 patched system
> >
> > Can you describe how you got this data? Were you using the gstat
> > code or some other code?
>
> Yes, it's basically gstat data.
The reason I ask this is that I don't think the data you are getting
from gstat is what you think you are... It accumulates time for a set
of operations and then divides by the count... So I'm not sure if the
stat improvements you are seeing are as meaningful as you might think
they are...
> > Also, was your control system w/ the patch, but w/ the sysctl set to
> > zero to possibly eliminate any code alignment issues?
>
> Both systems use the same code base and build. Patched system has patch
> included, "control" system does not have the patch. I can rerun my tests with
> sysctl set to zero and use it as "control". So, the answer to your question
> is "no".
I don't believe the code would make a difference, but more wanted to
know what control was...
> >> graphs show max/avg disk read service times for both systems across 36
> >> spinning drives. both systems are relatively busy serving production
> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
> >> represent time when systems are refreshing their content, i.e. disks
> >> are both reading and writing at the same time.
> >
> > Can you describe why you think this change makes an improvement? Unless
> > you're running 10k or 15k RPM drives, 128 seems like a large number.. as
> > that's about halve number of IOPs that a normal HD handles in a second..
>
> Our (Netflix) load is basically random disk io. We have tweaked the system to
> ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for
> majority of the requests. However offsets we read from are all over the
> place. It appears that we are getting into situation where larger offsets are
> getting delayed because smaller offsets are "jumping" ahead of them. Forcing
> bioq insert tail operation and effectively moving insertion point seems to
> help avoiding getting into this situation. And, no. We don't use 10k or 15k
> drives. Just regular enterprise 7200 sata drives.
I assume that the 1mb reads are then further broken up into 8 128kb
reads? so it's more like every 16 reads in your work load that you
insert the "ordered" io...
I want to make sure that we choose the right value for this number..
What number of IOPs are you seeing?
> > I assume you must be regularly seeing queue depths of 128+ for this
> > code to make a difference, do you see that w/ gstat?
>
> No, we don't see large (128+) queue sizes in gstat data. The way I see it, we
> don't have to have deep queue here. We could just have a steady stream of io
> requests where new, smaller, offsets consistently "jumping" ahead of older,
> larger offset. In fact gstat data show shallow queue of 5 or less items.
Sorry, I miss read the patch the first time... After rereading it,
the short summary is that if there hasn't been an ordered bio
(bioq_insert_tail) after 128 requests, the next request will be
"ordered"...
> > Also, do you see a similar throughput of the system?
>
> Yes. We do see almost identical throughput from both systems. I have not
> pushed the system to its limit yet, but having much smoother disk read
> service time is important for us because we use it as one of the components
> of system health metrics. We also need to ensure that disk io request is
> actually dispatched to the disk in a timely manner.
Per above, have you measured at the application layer that you are
getting better latency times on your reads? Maybe by doing a ktrace
of the io, and calculating times between read and return or something
like that...
Have you looked at the geom disk schedulers work that Luigi did a few
years back? There have been known issues w/ our io scheduler for a
long time... If you search the mailing lists, you'll see lots of
reports from some processes starving out others, probably due to a
similar issue... I've seen similar unfair behavior between processes,
but spend time tracking it down...
It does look like a good improvement though...
Thanks for the work!
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "[email protected]"