Re: [Drizzle-discuss] New Low Hanging Fruit Blueprint - Fast Timers

Stewart Smith Tue, 26 Jan 2010 18:45:31 -0800

On Tue, Jan 26, 2010 at 08:41:02PM -0500, Jay Pipes wrote:
> The work from PeterG and Marc Alff on the my_rdtsc.c file is pretty
> darn good and looks ripe for "libying", C++-ifying, and possibly
> adding to the drizzled tree.

I think rdtsc is not at all a good idea in this modern world.

Here's why:

It's relatively expensive. It's reported to be 150-200 cycles! (which
sounds like nothing until you execute it a *lot*)

It's also inaccurate in SMP (i.e. all current CPUs) when threads
migrate between CPUs (and you can't really detect that).

It's also intel specific.

http://msdn.microsoft.com/en-us/library/ee417693(VS.85,printer).aspx

Is a pretty good article outlining the problems.

The worst problem? You're always running it. So even when you don't
want to look at the performance data, you're still spending a lot of
cycles getting the time and accounting it.

By their own benchmarks, simply compiling in PERFORMANCE_SCHEMA and
*never* using it gets you a ~4% decrease in SQL performance.

This is pretty easily reproducable if you have a program running a
tight loop and put rdtsc calls around it. You easily notice the
performance degredation.

So... you think "aha! i'll only enable the calls if my performance
thing is enabled." Turns out the if() statement is *incredibly*
expensive and gains you exactly nothing.

So you certainly don't want to do this a lot around any code that is
performance critical (e.g. mutexes). This is my big disagreement with
something like the MySQL PERFORMANCE_SCHEMA patch.

So I started experimenting. What if we just ran a bunch of
NOOPs instead? Turns out on modern x86 processors, NOOPs are free. I
could not measure any difference in performance when inserting no-ops
into a tight loop.

So you could binary patch in at runtime the rdtsc calls... which gets
messy (but workable).

However, there is a better way! The Linux perf_events subsystem. This
lets userspace have a cross platform interface to both the PMU
(Performance Monitoring Unit) of the CPU (and the kernel will share it
amongst all the processes wanting to use it, as well as tell you how
long you had it for so you can extrapolate results) as well as access
to software events (such as page faults, context switches).

This is what I used (in its infancy) to do:

http://www.flamingspork.com/blog/2009/10/10/how-many-cpu-cycles-does-a-sql-query-take-or-pagefaults-caused-or-l2-cache-misses-or-cpu-migrations/

"how many CPU cycles were spent executing this query" function.

perf_events also gives you the tools to make a profiler, a
profiler that we can turn on and off in software at runtime and which
also can reach down into the kernel and tell you *exactly* where all
the time is being spent.

So instead of wrapping everything we think may take time with rdtsc
calls (and slowing everything down all of the time), we just turn on a
sampling profiler when we want it. If we want to present it a bit
nicer than a backtrace, we can easily map symbols to nice things.

So with a more integrated system of this, we'd probably have a if()
both before and after the query (setup and cleanup)... which I think
we can take the hit of. If the performance stuff is not enabled for
this query, then the impact is only two compares per query (i.e.
effectively nothing).

If it is enabled, then there is (of course) a cost. But by using the
PMU and existing infrastructure, it's likely to a) be a lot less and
b) be fixed for everyone to be better.

What about non-linux platforms? Well... on OSX and Solaris you have
DTrace which should get us most of the same things.... and who cares
about other platforms yet?

RDTSC is a "time elapsed" function. Not a "time spent executing" so that
if the load on your machine increases, so likely does the time elapsed
for queries.

The symptoms for "there is huge load on my server" (response time
going down) and "somebody dropped the index that makes this query
fast" would be the same.

We already get wall time for a query, with perf_events (like my
function above) we could then get 2 numbers that could be used to
differentiate between "machine is under load" and "the way my query is
being executed has changed as it now takes 100x the cycles to complete".

perf_events is also CPU agnostic. You can get at CPU specific
features, but you also have generic interfaces. So it's easy to be
cross architecture. No #ifdef x86!!!!

--
Stewart Smith

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help : https://help.launchpad.net/ListHelp

Re: [Drizzle-discuss] New Low Hanging Fruit Blueprint - Fast Timers

Reply via email to