> On Fri, Nov 16, 2007 at 06:29:45AM -0800, Paul van > den Bogaard wrote: > > Result: had to reboot my T2000 by using the system > console and powering it off and on again after > waiting for more than 12 hours to see if it came > back. > > That's quite surprising -- and distressing. Did you > get a dump? Do you have > any more data from which we might be able to debug > the problem? > > > Sure too many probes enabled. However this result > is lets say "surprising". Indeed very dangerous since > database can be used in production environments. In > my view this is not good. > > That doesn't sound like too many probes enabled at > all. We've enabled millions > of probes without incident. > > > dtrace -x dynvarsize=64m -x cleanrate=203 -qs lw.d > >LW/lw.out > trace: failed to compile script lw.d: line 374: > failed to grab process 14510 > > This may be that you're hitting the file descriptor > limit. Try taking a look > with truss on the dtrace(1M) process.
I think the processes that needed to be traced exited because they were done. It just took too long to compile (something that includes "attaching to probes") > > > Questions: how do I make this workable. Meaning > speedier in startup. Without reducing my attention > to just one or two processes. > We'd need to figure out where the system is spending > its time -- DTrace is > a great place to start. > > > Question: why is my system so unresponsive (I > waited more that 12 hours to see if it came back) > forcing me to power off/on to get it alive and > kicking again? > > Let's work offline to try to sort this out. > > > And please note that we are planning to add many, > many more probes in PG source. However if the above > is the result I fear great fears ... > > You're not enabling that many probes; I don't think > that's the cause of > either problem. > > Adam > > -- > Adam Leventhal, FishWorks > http://blogs.sun.com/ahl > ________________________ > dtrace-discuss mailing list > [email protected] Did some more tests and it looks like there is a relation between number of probes, size of shared memory and time-to-compile. Using a PostgreSQL 8.3 beta 1 database on a T2000 (32GB) I varied the size of the shared memory, the number of processes and the number of probes. The shared memory test is sized by changing the number of so called shared_buffers. Each buffer is 8KB. Timing is done in a simple way. When the BEGIN probe fires its action printf's a string. Output is redirected to a file. With a while ls -l dtrace.outputfile; do sleep 5; done I see the file size. When it changes from 0 to not zero I know the BEGIN probe fired. Sure there are some other things that influence this, but I expect to look into minutes of total time to get a better understanding of what is happening. Results (all times in seconds): 50 probes enabled (5 probes for 10 different processes): <code> N users 100 400 1200 BUFFERS ----------------------- 100000 25 30 30 400000 45 50 50 1200000 95 105 105 </code> Changing the number of users has vary little effect. Changing the size of shared memory definitely has an effect. 300 probes enabled (30 probes for 10 different processes): (beware the 400000 buffers and the 400 user tests are not present. Only looking at the "extremes", and added to these a little by increasing the number of buffers too) <code> N users 100 1200 BUFFERS ----------------- 100000 125 155 1200000 545 595 1600000 695 730 2000000 memory problems, scan rate shows up in vmstat -p. </code> 1200 probes (30 probes for 40 different processes): <code> N users 100 1200 BUFFERS ------------------ 100000 475 600 1600000 2710 3055 </code> These numbers speak for themselves. The long time to compile has a relation with shared memory size and the number of probes that are enabled. Remaining question is why? -- This message posted from opensolaris.org _______________________________________________ dtrace-discuss mailing list [email protected]
