Paul McCullagh wrote:
On Aug 7, 2009, at 10:39 AM, Stewart Smith wrote:

On Wed, Aug 05, 2009 at 03:13:34PM +1000, Arjen Lentz wrote:
The issue in MySQL has been overhead of such instrumentation,
particularly also when not used. Some cause 5-20% perf loss which is
unacceptable.

110% agree.

If you're not doing analysis of anything, it shouldn't cost you.

You also shouldn't have to restart, rebuild or anything like that.

I think I know how to do this too.


I have this inkling that it's the "if(profiling_enabled)" inserted
everywhere that kills us.

This is pretty easy to check. Say we have some function f() that is
going to do some counting for us (e.g. number of rows fetched, number of
times mutex X was taken). If profiling is disabled, we want this to use
0 CPU.

calling an empty function int f(int) a billion times in a loop is
roughly equivilnet of just running through the loop (yes, i built with
gcc -O0 and checked the produced code). By roughly I do mean next to
impossible to measure.

If you add a simple "if(x) something;" to the function f(), it is
noticably slower! (roughly 20% in my tests).

So we really don't want to do that compare.

Now... about this time somebody is going to jump up and suggest using
DTrace to insert code at runtime. Not on Linux, so is worse than useless
here.

But we can do some cool self modifying code tricks.

The same do-nothing f() does not take any longer to run if we insert a few
no-ops. (i tried inserting 4 NOP instructions, which are single byte...
i do wonder if the multi-byte NOP instruction could help here too).

So... when a profile hook is enabled, we just modify f() to call the
real profiling function. This can either be done with an atomic
instruction writing out the appropriate CALL instruction, or we can put
in a small JMP around the NOPs as we fill it out.


and there's a number of tricks to do this pretty easily for all the
possible points to hook in profiling stuff.

Modifying code is an option, but at the same time it is quite a hack. A major disadvantage is that it has to be done for each type of hardware supported.

Exactly, and I believe it's a non-starter approach for that reason alone.

I have another suggestion, which I have found works well for PBXT (http://pbxt.blogspot.com/2008/12/xtstat-tells-you-exactly-what-pbxt-is.html). A simple increment is a very cheap operation, as long as it can be done without requiring a lock.

This is essentially what we already have in the way of the current sys_var system for thread-local data that is "merged" upong Session::cleanup()

(And, if you are just doing an increment, then you don't have to bother with a if(profiling_enabled), you just do the increment all the time.)

++

To avoid locking, each thread needs a complete set of tracking variables (counters) as part of its THD structure.

s/THD/Session

Also, you must understand that there is no one-to-one thread-to-Session guarantee.

Because Sessions may be executed in a thread pool, there must be a way of either:

a) Merging Session-local stats into the global system variables structure upon Session destruction or rescheduling via a scheduling thread. Currently this operation does not acquire a lock around the global systems variables in the Session destructor:

Session::~Session()
{
...
  add_to_status(&global_status_var, &status_var);
...
}

void add_to_status(STATUS_VAR *to_var, STATUS_VAR *from_var)
{
  ulong *end= (ulong*) ((unsigned char*) to_var +
                        offsetof(STATUS_VAR, last_system_status_var) +
                        sizeof(ulong));
  ulong *to= (ulong*) to_var, *from= (ulong*) from_var;

  while (to != end)
    *(to++)+= *(from++);
}

I don't know if this critical section was deliberately left unprotected by LOCK_status or not...still looking into this. Also, MontyT is completely redesigning the system variables system, so the above "bookmarking" code will not likely look the same in a few weeks.

b) Alternately, the Session's local status variables need to be persisted to a system table in a row-level locking storage engine using the standard write_row() interface of the storage engine interface. Stewart currently is working on this (see his i_s storage engine branches...)

Either way, you incur locking and instruction costs. These costs have been deemed too high by MySQL engineering for the hundreds (thousands?) of metrics that the MySQL performance schema monitors (or is able to monitor). This is likely because the frequency of certain events in the performance schema is quite high?

The profiling code pays the price for this. In order to get the current state of all counters it goes through the list of THDs and accumulates the THD related counters.

But, this is OK, because this price is only paid when you are actually profiling.

Agreed in principle, yes.

This method not only works for things like "number of bytes written", but can also be used to measure time. There is a little trick involved here, but the result is that you can see, for example, if the server is hanging in a fsync() call in realtime.

Then we should create a kind of "drizzlestat" program which SELECTs the current counter values, and displays the statistics in columns.

Before this is possible, an API into the performance data counters must be written. I don't want programs willy-nilly accessing internal kernel and storage engine data without going through a proper interface...we're trying to move away from that sort of thing :)

This is much better then dumping loads of performance schema tables on a user and saying, the data is there if you need it.

Agreed.

I am also not a believer in gathering statistics on everything (for example, every semaphore), and letting the user figure out what is important.

OK, sure, but what if you don't already know the cause of your slowdown is a mutex or semaphore and want to find this out?

As the developers we need to decide what are the performance critical parameters, and just provide those statistics. Of course, statistics can be added later if we see we have missed something. But rather that then a whole bunch of irrelevant values that make finding a problem like looking for a needle in a haystack.

Agreed, but see point above...

Marc Alff took an approach that causes almost no overhead if the performance schema is not *compiled in*. There is an overhead if the performance schema is compiled in and the DBA is not careful to specify only those things she is interested in.

I'd love to find a perfect medium between Marc's approach (which nicely NOOPs the performance schema code behind #define templates when it is not compiled in) and your discussion above of non-storage of all data pieces automatically.

Cheers,

-jay

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to