On Tue, Jan 26, 2010 at 08:41:02PM -0500, Jay Pipes wrote: > The work from PeterG and Marc Alff on the my_rdtsc.c file is pretty > darn good and looks ripe for "libying", C++-ifying, and possibly > adding to the drizzled tree.
I think rdtsc is not at all a good idea in this modern world. Here's why: It's relatively expensive. It's reported to be 150-200 cycles! (which sounds like nothing until you execute it a *lot*) It's also inaccurate in SMP (i.e. all current CPUs) when threads migrate between CPUs (and you can't really detect that). It's also intel specific. http://msdn.microsoft.com/en-us/library/ee417693(VS.85,printer).aspx Is a pretty good article outlining the problems. The worst problem? You're always running it. So even when you don't want to look at the performance data, you're still spending a lot of cycles getting the time and accounting it. By their own benchmarks, simply compiling in PERFORMANCE_SCHEMA and *never* using it gets you a ~4% decrease in SQL performance. This is pretty easily reproducable if you have a program running a tight loop and put rdtsc calls around it. You easily notice the performance degredation. So... you think "aha! i'll only enable the calls if my performance thing is enabled." Turns out the if() statement is *incredibly* expensive and gains you exactly nothing. So you certainly don't want to do this a lot around any code that is performance critical (e.g. mutexes). This is my big disagreement with something like the MySQL PERFORMANCE_SCHEMA patch. So I started experimenting. What if we just ran a bunch of NOOPs instead? Turns out on modern x86 processors, NOOPs are free. I could not measure any difference in performance when inserting no-ops into a tight loop. So you could binary patch in at runtime the rdtsc calls... which gets messy (but workable). However, there is a better way! The Linux perf_events subsystem. This lets userspace have a cross platform interface to both the PMU (Performance Monitoring Unit) of the CPU (and the kernel will share it amongst all the processes wanting to use it, as well as tell you how long you had it for so you can extrapolate results) as well as access to software events (such as page faults, context switches). This is what I used (in its infancy) to do: http://www.flamingspork.com/blog/2009/10/10/how-many-cpu-cycles-does-a-sql-query-take-or-pagefaults-caused-or-l2-cache-misses-or-cpu-migrations/ "how many CPU cycles were spent executing this query" function. perf_events also gives you the tools to make a profiler, a profiler that we can turn on and off in software at runtime and which also can reach down into the kernel and tell you *exactly* where all the time is being spent. So instead of wrapping everything we think may take time with rdtsc calls (and slowing everything down all of the time), we just turn on a sampling profiler when we want it. If we want to present it a bit nicer than a backtrace, we can easily map symbols to nice things. So with a more integrated system of this, we'd probably have a if() both before and after the query (setup and cleanup)... which I think we can take the hit of. If the performance stuff is not enabled for this query, then the impact is only two compares per query (i.e. effectively nothing). If it is enabled, then there is (of course) a cost. But by using the PMU and existing infrastructure, it's likely to a) be a lot less and b) be fixed for everyone to be better. What about non-linux platforms? Well... on OSX and Solaris you have DTrace which should get us most of the same things.... and who cares about other platforms yet? RDTSC is a "time elapsed" function. Not a "time spent executing" so that if the load on your machine increases, so likely does the time elapsed for queries. The symptoms for "there is huge load on my server" (response time going down) and "somebody dropped the index that makes this query fast" would be the same. We already get wall time for a query, with perf_events (like my function above) we could then get 2 numbers that could be used to differentiate between "machine is under load" and "the way my query is being executed has changed as it now takes 100x the cycles to complete". perf_events is also CPU agnostic. You can get at CPU specific features, but you also have generic interfaces. So it's easy to be cross architecture. No #ifdef x86!!!! -- Stewart Smith _______________________________________________ Mailing list: https://launchpad.net/~drizzle-discuss Post to : [email protected] Unsubscribe : https://launchpad.net/~drizzle-discuss More help : https://help.launchpad.net/ListHelp

