Is it possible that speculation buffers are per CPU? I ran a program
that reads data from a socket and mostly juts sits in pollsys. When I
ran the program I enclosed, it often did not print out anything, even
though I know it went into pollsys. I even changed the tests to print
out all of the speculation buffers without regard to the setting of
the ts variable, and I still did not see anything, but I almost always
had indices where the ts variable was non-zero but the commit didn't
print anything. I thought maybe the speculation buffer might need to
be on the samne CPU or something.
This is really a problem, I have to be sure I print out the hung
threads. It takes a while for the processes and I can't keep going
back and saying, "well, that script didn't get it, try this one."
Brian Utterback wrote:
> A while back, I asked how to detect speculations that didn't complete.
> At that time, I was told that the return value of speculation() is
> simply an integer in the range from 1 up to the value of nspec, and so
> you could tap into any speculation by using an integer.
>
> Here was and is my problem. I have a customer who have a call to the
> connect syscall that hangs and never returns. This is using UNIX
> domain sockets, so that really shouldn't happen at all. I am trying to
> find out what is different about the threads that hang. So, I create a
> speculation and trace the thread. The only thing is, how do I commit
> the speculation based on something that doesn't happen?
>
> The suggestion at the time was to place a timestamp into an array when
> the call starts, and then use the tick-100hz trick to iterate over the
> array, looking for timestamps above some threshold in the past. When I
> found one, use the index as the speculation number and commit it.
>
> That sounded good, but it didn't work. I thought maybe I was getting
> too clever with the tick trick and all that, so based on a message
> earlier this week, I though about simplifying the whole thing using
> END probes and unrolling the loop. You see, the customer can see the
> hung processes, so I figured if I committed all of the outstanding
> speculations when the DTrace script is interrupted, that would be just
> as good and much simpler, less chance something could go wrong.
>
> So, just to test it out before giving it to my customer, I wrote a
> script that uses the pollsys command and traces the kernel threads
> that are in pollsys at any given moment, and also it prints a stack
> trace when it goes into the resume call (which is the last kernel
> function called in pollsys when it blocks).
>
> I wrote a macro to print out each speculation, along with how many
> milliseconds it had been in the pollsys call. I stored the timestamp
> for each speculation for when it was created, and then printed out the
> difference when the script is interrupted. I used a this variable to
> iterate through the list of speculations.
>
> It seems to work, except for some reason I get output for speculations
> with no speculation buffer. Here is the script:
>
> #!/usr/sbin/dtrace -Cs
>
> #pragma D option nspec=16
>
> #define printspec END \
> /ts[this->spec]/ \
> { \
> speculate(this->spec); \
> trace(this->spec); \
> trace((timestamp-ts[this->spec])/1000000); \
>
> trace("\n===================================================\n\n"); \
> } \
> END \
> /ts[this->spec]/ \
> { \
> commit(this->spec); \
> this->spec++; \
> }
>
> long ts[17];
>
> syscall::pollsys:entry
> {
> self->spec = speculation();
> speculate(self->spec);
> ts[self->spec] = timestamp;
> trace(self->spec);
> trace(tid);
> trace(pid);
> trace(execname);
> }
>
> fbt::resume:entry
> /self->spec/
> {
> speculate(self->spec);
> stack();
> }
>
> syscall::pollsys:return
> /self->spec/
> {
> ts[self->spec] = 0;
> discard(self->spec);
> self->spec = 0;
> }
>
> END{
> trace("end");
> trace(timestamp);
> this->spec = 1;
> }
>
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
> printspec
>
>
> And here is the output after 15 seconds:
>
> dtrace: script './test1.d' matched 36 probes
> ^C
> CPU ID FUNCTION:NAME
> 1 2 :END end 2071629828966800
> 1 7359 pollsys:entry 1 1 19878 snmpd
> 1 11509 resume:entry
> genunix`cv_timedwait_sig+0x160
> genunix`cv_waituntil_sig+0x84
> genunix`poll_common+0x498
> genunix`pollsys+0xcc
> genunix`dtrace_systrace_syscall+0xa8
> unix`syscall_trap+0xac
>
> 1 2 :END 1 675
> ===================================================
>
>
> 1 7359 pollsys:entry 2 1 550 sendmail
> 1 11509 resume:entry
> genunix`cv_timedwait_sig+0x160
> genunix`cv_waituntil_sig+0x84
> genunix`poll_common+0x498
> genunix`pollsys+0xcc
> genunix`dtrace_systrace_syscall32+0xa8
> unix`syscall_trap32+0xcc
>
> 1 2 :END 2 4748
> ===================================================
>
>
> 1 7359 pollsys:entry 3 1 456 dhcpagent
> 1 11509 resume:entry
> genunix`cv_timedwait_sig+0x160
> genunix`cv_waituntil_sig+0x84
> genunix`poll_common+0x498
> genunix`pollsys+0xcc
> genunix`dtrace_systrace_syscall32+0xa8
> unix`syscall_trap32+0xcc
>
> 1 2 :END 3 818
> ===================================================
>
>
> 1 7359 pollsys:entry 4 12 911 java
> 1 11509 resume:entry
> genunix`cv_timedwait_sig+0x160
> genunix`cv_waituntil_sig+0x84
> genunix`poll_common+0x13c
> genunix`pollsys+0xcc
> genunix`dtrace_systrace_syscall32+0xa8
> unix`syscall_trap32+0xcc
>
> 1 2 :END 4 29
> ===================================================
>
>
> 1 7359 pollsys:entry 5 14 911 java
> 1 11509 resume:entry
> genunix`cv_timedwait_sig+0x160
> genunix`cv_waituntil_sig+0x84
> genunix`poll_common+0x13c
> genunix`pollsys+0xcc
> genunix`dtrace_systrace_syscall32+0xa8
> unix`syscall_trap32+0xcc
>
> 1 2 :END 5 9163
> ===================================================
>
>
> 1 2 :END 6 1686
> ===================================================
>
>
>
> See the problem? In this case speculations 5 and 6 both have non-zero
> ts array entries (or they would not have printed) but there is
> nothing in the speculation buffer (otherwise there would have been the
> first set of traces printed). Any ideas why this is happening?
--
blu
There are two rules in life:
Rule 1- Don't tell people everything you know
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
_______________________________________________
dtrace-discuss mailing list
[email protected]