[dtrace-discuss] Struggling with dynvar space: how big should it become?

Paul van den Bogaard Thu, 28 Feb 2008 06:00:42 -0800

Trying to get a script to run for an hour. Up till now every attempt failed 
before the hour was over. Latest command was:


dtrace  -q -x dynvarsize=16g -x cleanrate=400  -s /DTrace/newstr2.d

growing that dynvarsize from 128m, 25, 512 and doubling. My system "only" has 
64GB of RAM and still dtrace produces many of the following:

dtrace: 1255 dynamic variable drops with non-empty rinsing list
dtrace: 37907 dynamic variable drops with non-empty dirty list
dtrace: 701859 dynamic variable drops
dtrace: 1279 dynamic variable drops with non-empty rinsing list
dtrace: 43404 dynamic variable drops with non-empty dirty list
dtrace: 711953 dynamic variable drops
dtrace: 3990 dynamic variable drops with non-empty rinsing list
dtrace: 42116 dynamic variable drops with non-empty dirty list
dtrace: 170163 dynamic variable drops
dtrace: 9 dynamic variable drops with non-empty rinsing list
dtrace: 8591 dynamic variable drops with non-empty dirty list

With the 16g setting it takes about 20 minutes before the first of these 
messages is produced. After that it is producing them continuously.

I am "only" running a 384 process setup with PostgreSQL and hope to be abel to 
do one with 1280 processes. The DTrace script in use references a couple of 
"home grown" USDT probes. The provider definition is as follows:

provider postgresql {

probe transaction__start(int);
probe transaction__commit(int);
probe transaction__abort(int);
probe lock__startwait(int, int);
probe lock__endwait(int, int);

probe lwlock__acquire(int, int);
probe lwlock__acquire__enter(int, int);
probe lwlock__acquire__return(int, int);
probe lwlock__release(int);
probe lwlock__release__enter(int);
probe lwlock__release__return(int, int);
probe lwlock__startwait(int, int, int, int);
probe lwlock__endwait(int, int, int);
probe lwlock__condacquire(int, int);
probe lwlock__condacquire__fail(int, int);

probe getsnapshotdata(int, int, int, int, int);
probe ready_for_query_start();
probe ready_for_query_end();
probe ignore_till_sync();
probe handle_command_start(int);
probe handle_command_end(int);

};

The failing script:

self int handleCommand;
self int readyForQuery;
self char commandType;
self int commandTypeInitialized;

BEGIN { started = timestamp; }

postgresql*:postgres::ready_for_query_start
{
        self->readyForQuery = timestamp;
}
 
postgresql*:postgres::ready_for_query_end
/self->readyForQuery /
{
        @count["readyForQuery", '-'] = count();
        @elapsed["readyForQuery", '-'] = sum(timestamp - self->readyForQuery);
        self->readyForQuery = 0;
}
 
postgresql*:postgres::ignore_till_sync
{
        @count["ignore_till_sync", '-'] = count();
}
 
postgresql*:postgres::handle_command_start
{
        self->commandTypeInitialized = 1;
}
 
postgresql*:postgres::handle_command_start
/arg0 == 0/
{
        self->handleCommand = 0;
        @count["zero",'-'] = count();
}
 
postgresql*:postgres::handle_command_start
/arg0 != 0/
{
        self->handleCommand = timestamp;
        self->commandType = (char)arg0;
}
 
postgresql*:postgres::handle_command_end
/self->handleCommand/
{[/b]
        @count["handleCommand", arg0] = count();
        @elapsed["handleCommand", arg0] = sum(timestamp - self->handleCommand);
        self->handleCommand = 0;
        self->commandType = '-';
}


postgresql*:postgres:LWLockAcquire:lwlock-acquire
{       
        self->ts[arg0,arg1]=timestamp;
        self->mode[arg0] = arg1;
}       
        
postgresql*:postgres:LWLockRelease:lwlock-release
/self->commandTypeInitialized && arg0 < 25 && self->ts[arg0,self->mode[arg0] ] /
{       
        this->mode = self->mode[arg0];
        @lockheldCount[arg0,this->mode,self->commandType] = count();
        @lockheldTime[arg0,this->mode,self->commandType] = sum (timestamp - 
self->ts[arg0,this->mode]);
        self->ts[arg0,this->mode] = 0;
}       
        
postgresql*:postgres:LWLockRelease:lwlock-release
/self->commandTypeInitialized && arg0 >= 25 && self->ts[arg0,self->mode[arg0]] /
{       
        this->mode = self->mode[arg0];
        @lockheldCount[25,this->mode,self->commandType] = count();
        @lockheldTime[25,this->mode,self->commandType] = sum (timestamp - 
self->ts[arg0,this->mode]);
        self->ts[arg0,this->mode] = 0;
}       
        
postgresql*:postgres:LWLockAcquire:lwlock-startwait
/arg0 < 25/
{       
        self->tsw[arg0,arg1] = timestamp;
        @queueLengthStartwaitMax[ arg0,arg1 ] = max(arg2);
        @queueLengthStartwaitAvg[ arg0,arg1 ] = avg(arg2);
        @queueLengt2StartwaitMax[ arg0,arg1 ] = max(arg3);
        @queueLengt2StartwaitAvg[ arg0,arg1 ] = avg(arg3);
}       
        
postgresql*:postgres:LWLockAcquire:lwlock-startwait
/arg0 >= 25/
{       
        self->tsw[25,arg1] = timestamp;
        @queueLengthStartwaitMax[25,arg1 ] = max(arg2);
        @queueLengthStartwaitAvg[25,arg1 ] = avg(arg2);
        @queueLengt2StartwaitMax[25,arg1 ] = max(arg3);
        @queueLengt2StartwaitAvg[25,arg1 ] = avg(arg3);
}       
postgresql*:postgres:LWLockAcquire:lwlock-endwait
/self->commandTypeInitialized && self->tsw[arg0,arg1] && (arg0 < 25)/
{      
        @lockwaitCount[arg0,arg1,self->commandType] = count();
        @lockwaitTime[arg0,arg1,self->commandType] = sum(timestamp - 
self->tsw[arg0,arg1]);
        @cycleCountMax[arg0, arg1] = max(arg2);
        @cycleCountAvg[arg0, arg1] = avg(arg2);
        self->tsw[arg0,arg1] = 0;
}                  
                   
postgresql*:postgres:LWLockAcquire:lwlock-endwait
/self->commandTypeInitialized && self->tsw[arg0,arg1] && (arg0 >= 25)/
{      
        @lockwaitCount[25,arg1,self->commandType] = count();
        @lockwaitTime[25,arg1,self->commandType] = sum(timestamp - 
self->tsw[arg0,arg1]);
        self->tsw[arg0,arg1] = 0;
        @cycleCountMax[25, arg1] = max(arg2);
        @cycleCountAvg[25, arg1] = avg(arg2);
}      
       
postgresql*:postgres:LWLockAcquire:lwlock-acquire-enter
{      
        self->acqtime[arg0,arg1] = timestamp;
}

postgresql*:postgres:LWLockAcquire:lwlock-acquire-return
/self->commandTypeInitialized && arg0 < 25 && self->acqtime[arg0,arg1]/
{      
        @lockAcquireCount[ arg0,arg1,self->commandType ] = count();
        @lockAcquireTime[arg0,arg1,self->commandType ] = sum( timestamp - 
self->acqtime[arg0,arg1]);
        self->acqtime[arg0,arg1] = 0;
}

postgresql*:postgres:LWLockAcquire:lwlock-acquire-return
/self->commandTypeInitialized && arg0 >= 25 && self->acqtime[arg0,arg1]/
{      
        @lockAcquireCount[ 25,arg1,self->commandType ] = count();
        @lockAcquireTime[25,arg1,self->commandType ] = sum( timestamp - 
self->acqtime[arg0,arg1]);
        self->acqtime[arg0,arg1] = 0;
}      
       
postgresql*:postgres:LWLockRelease:lwlock-release-enter
/arg0 < 25/
{
        self->acqtime[arg0,self->mode[arg0] ] = timestamp;
}      
       
postgresql*:postgres:LWLockRelease:lwlock-release-enter
/arg0 >= 25/
{      
        self->acqtime[arg0,self->mode[25] ] = timestamp;
}
 
postgresql*:postgres:LWLockRelease:lwlock-release-return
/self->commandTypeInitialized && arg0 < 25 && self->acqtime[arg0, 
self->mode[arg0] ]/
{
        this->mode = self->mode[arg0];
        @lockReleaseCount[ arg0, this->mode,self->commandType ] = count();
        @lockReleaseTime[arg0, this->mode,self->commandType ] = sum( timestamp 
- self->acqtime[arg0, this-
>mode]);
        self->acqtime[arg0, this->mode] = 0;
        self->mode[arg0] = 0;
}

postgresql*:postgres:LWLockRelease:lwlock-release-return
/self->commandTypeInitialized && arg0 >= 25 && self->acqtime[25, 
self->mode[25]]/
{
        this->mode = self->mode[25];
        @lockReleaseCount[25, this->mode,self->commandType] = count();
        @lockReleaseTime[25, this->mode,self->commandType ] = sum( timestamp - 
self->acqtime[25, this->mod
e]);
        self->acqtime[25, this->mode] = 0;
        self->mode[25] = 0;
}


:::tick-3600s { exit(0); }


:::END {
        printf("WALL: %d millisec\n", ((timestamp - started)/1000000) );

                                              printa("CH %2d %1d %2c [EMAIL 
PROTECTED]",@lockheldCount);
        normalize(@lockheldTime,1000000);     printa("TH %2d %1d %2c [EMAIL 
PROTECTED]",@lockheldTime);

                                              printa("CA %2d %1d %2c [EMAIL 
PROTECTED]", @lockAcquireCount);
        normalize(@lockAcquireTime, 1000000); printa("TA %2d %1d %2c [EMAIL 
PROTECTED]", @lockAcquireTime);

                                              printa("CR %2d %1d %2c [EMAIL 
PROTECTED]", @lockReleaseCount);
        normalize(@lockReleaseTime, 1000000); printa("TR %2d %1d %2c [EMAIL 
PROTECTED]", @lockReleaseTime);

                                              printa("CW %2d %1d %2c [EMAIL 
PROTECTED]",@lockwaitCount);
        normalize(@lockwaitTime,1000000);     printa("TW %2d %1d %2c [EMAIL 
PROTECTED]",@lockwaitTime);

        printa("QM %2d %1d [EMAIL PROTECTED]", @queueLengthStartwaitMax);
        printa("QA %2d %1d [EMAIL PROTECTED]", @queueLengthStartwaitAvg);

        printa("QM2 %2d %1d [EMAIL PROTECTED]", @queueLengt2StartwaitMax);
        printa("QA2 %2d %1d [EMAIL PROTECTED]", @queueLengt2StartwaitAvg);

        printa("CM %2d %1d [EMAIL PROTECTED]", @cycleCountMax);
        printa("CV %2d %1d [EMAIL PROTECTED]", @cycleCountAvg);

        normalize(@elapsed, 1000000);
        printa("C %10s  %2c  [EMAIL PROTECTED]", @count);
        printa("T %10s  %2c  [EMAIL PROTECTED]", @elapsed);
}


this might seem a little overwhelming but once some things are  known it should 
not be too bad :-)

The following lwlock xxxxx probes 


probe lwlock__acquire(int, int);
probe lwlock__acquire__enter(int, int);
probe lwlock__acquire__return(int, int);
probe lwlock__startwait(int, int, int, int);
probe lwlock__endwait(int, int, int);

all have at least two parameters: in all these cases arg0 is an id for a lock 
being used and arg1 is a mode: a value of either 0 or 1 (2 different values).
Lockid's < 25 are special. All others are only of interest when it comes to 
their combined "strength". 

The following probes fire during a release function call where the lock with 
given id (arg0) is freed.


probe lwlock__release(int);
probe lwlock__release__enter(int);
probe lwlock__release__return(int, int);

Since the mode is not passed we need a self->mode in the script to recall what 
it was during the lifetime of request/wait-for-it/use/release. BTW the lw part 
stands for LeightWeight . And indeed that is wat these locks are supposed to 
be. Very comparable to what Oracle calls latches.

The other probes used are to set context. Each database session waits on a 
socket for a new request to arrive. This is handled in a state machine.  The 
normal parts of SQL processing can be seen. In the handleCommand a  character 
representing the next command/state is passed: E:xecute; P:arse; B:ind; are the 
ones that I see 99.9999999% of the time. The others are (to me) of no 
importance at this moment. 
I would like to see how many lwlock waits are out there (count and time) for 
each state/command and ofcourse also know how long a parse/bind/execute takes.

I went over the script over and over and do not see any vars that should have 
been null-ed. commandType is always set: it defines the context in which a 
measurement should be placed.

Any clues how to make this work are very welcome.

Thanks
Paul

BTW would also appreciate some feedback on how to format things so it would be 
a little more readable. The indentation is in the source, but does not show up 
in the preview and the final post.


--
This message posted from opensolaris.org
_______________________________________________
dtrace-discuss mailing list
[email protected]

[dtrace-discuss] Struggling with dynvar space: how big should it become?

Reply via email to