Re: [dtrace-discuss] CPU dispatcher and buffer questions
I'm not sure I understand what is being asked here, but I'll take a shot... Note it is virtually impossible to write a piece of software that is guaranteed to have sufficient space to buffer a given amount of data when the rate and size of the data flow is unknown. This is one of the robustness features of dtrace - it's smart enough to know that, and smart enough to let the user know when data can not be buffered. Yes, buffers are allocated per-CPU. There are several buffer types, depending on the dtrace invocation. Minimally, principle buffers are allocated per CPU when a dtrace consumer (dtrace(1M)) is executed. Read; http://wikis.sun.com/display/DTrace/Buffers+and+Buffering The self-read describes a thread local variable, one of several variable types available in DTrace. It defines the variable scope - each kernel thread that's on the CPU when the probe(s) fires will have it's own copy of a self- variable. There is only one kernel dispatcher, not one per CPU. There are per-CPU run queues managed by the dispatcher. As for running a DTrace script for hours/days/weeks, I have never been down that road. It is theoretically possible of course, and seems to be a good use of speculative buffers or a ring buffer policy. We can not guarantee it will execute without errors (dynamic variable drops, etc). We can guarantee you'll know when errors occur. How can such guarantees be made with a dynamic tool like dtrace? Does your customer know up-front how much data will be traced/processed/ consumed, and at what rate? Read this; http://blogs.oracle.com/bmc/resource/dtrace_tips.pdf Thanks /jim On Jul 1, 2011, at 9:30 AM, Scott Shurr wrote: Hello, I have a customer who has some dtrace questions. I am guessing that someone knows the answer to these, so I am asking here. Here are the questions: ** In this document, we will describe how we assume that DTrace uses its memory. Most assumptions result from [1]. We want these assumptions to be validated by a DTrace expert from Oracle. This validation is necessary to provide us confidence that DTrace can execute for a long period of time (in the order of weeks) along with our software, without introducing errors due to e.g. “dynamic variable drops”. In addition, we described a problem we experience with our DTrace script, for which we want to have support from you. [1] Sun Microsystems inc, “Solaris Dynamic Tracing Guide”, September 2008. Quotes from Solaris Dynamic Tracing Guide [1], with interpretation: •“Each time the variable self-read is referenced in your D program, the data object referenced is the one associated with the operating system thread that was executing when the corresponding DTrace probe fired.” oInterpretation: Per CPU there is a dispatcher that has its own thread, when it executes the sched:::on-cpu and sched:::off probes. •“At that time, the ring buffer is consumed and processed. dtrace processes each ring buffer in CPU order. Within a CPU's buffer, trace records will be displayed in order from oldest to youngest.” Interpretation: There is a principal buffer per CPU 1) Impact on Business We have a number of assumptions that we would like to verify about DTrace. 2) What is the OS version and the kernel patch level of the system? SunOS nlvdhe321 5.10 Generic_141444-09 sun4v sparc SUNW,T5240 3) What is the Firmware level of the system? SP firmware 3.0.10.2.b SP firmware build number: 56134 SP firmware date: Tue May 25 13:02:56 PDT 2010 SP filesystem version: 0.1.22 ** Thanks oracle.jpg Scott Shurr| Solaris and Network Domain, Global Systems Support Email: scott.sh...@oracle.com Phone: 781-442-1352 Oracle Global Customer Services Log, update, and monitor your Service Request online using My Oracle Support ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] DTrace Book - Deal Of The Day
(I know, I know - shameless). FYI - ebook deal of the day on Informit: http://www.informit.com/store/index.aspx Also FYI, the printing process is completed and boxes are being shipped to retailers. Amazon should be shipping next week. www.dtracebook.com will be up and running very, very soon, making all the scripts in the book available for download. Working on a feedback alias, but in the meantime please feel free to send feedback, comments, errata, etc to; jim.ma...@gmail.com Feel free to post positive feedback publically. :^) Thanks /jim ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] pid probes fail on OSX
28728 pthread_key_delete:entry 0 126037__tcf_0:entry 0 29716 _cleanup:entry 0 29717 _fwalk:entry 0 29643 __sflush:entry 0 29643 __sflush:entry 0 29643 __sflush:entry 0 28729 _exit:entry 2 50753 cerror:entry 2 28752 cthread_set_errno_self:entry 2 28614__error:entry 2 28614__error:entry 2 28463 stub helpers:entry 2 51054 dyld_stub_binder:entry 2 28697 _dyld_fast_stub_entry(void*, long):entry 2 27717 dyld::fastBindLazySymbol(ImageLoader**, unsigned long):entry 2 28106 ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, ImageLoader::LinkContext const):entry 2 28010 ImageLoaderMachO::segActualLoadAddress(unsigned int) const:entry 2 28074 ImageLoaderMachOCompressed::segmentCommandOffsets() const:entry 2 28090 ImageLoaderMachOCompressed::resolve(ImageLoader::LinkContext const, char const*, unsigned char, int, ImageLoader const**, Imag:entry 2 28075 ImageLoaderMachOCompressed::libImage(unsigned int) const:entry 2 28088 ImageLoaderMachOCompressed::resolveTwolevel(ImageLoader::LinkContext const, ImageLoader const*, bool, char const*, ImageLoader:entry 2 28002 ImageLoaderMachO::findExportedSymbol(char const*, bool, ImageLoader const**) const:entry 2 28102 ImageLoaderMachOCompressed::findExportedSymbol(char const*, ImageLoader const**) const:entry On Dec 7, 2010, at 5:09 PM, Hans Stimer wrote: Thank you for the quick response. I tried a number of things and got some interesting results. In short: 1) pid doesn't work with -c date, but does seem to work with other executables. On osx date is an executable and not a script or built in. 2) pid doesn't let you instrument all entries; I created a hello world program and dtrace didn't return for over 5 minutes * Modify to point to hello world program: $ sudo dtrace -n 'pid$target:::entry { trace(arg0); }' -c ./out dtrace: description 'pid$target:::entry ' matched 7947 probes ^C I gave it over 5 minutes and it didn't return * Trace everything connected to date: $ sudo dtrace -n 'pid$target:::entry { trace(arg0); }' -c date Tue Dec 7 13:49:34 PST 2010 dtrace: invalid probe specifier pid$target:::entry { trace(arg0); }: failed to create entry probe for 'stub_binding_helper': No such process * Find entry points for date $ sudo dtrace -ln 'pid$target:::entry,pid$target:::return' -c date Tue Dec 7 13:52:55 PST 2010 dtrace: invalid probe specifier pid$target:::entry,pid$target:::return: failed to create entry probe for 'stub_binding_helper': No such process * Find entry points for hello world program: $ sudo dtrace -ln 'pid$target:::entry,pid$target:::return' -c ./out ID PROVIDERMODULE FUNCTION NAME 144623 pid95634 dyld stub_binding_helper entry 144624 pid95634 dyld dyld_func_lookup entry 144625 pid95634 dyldoffset_to_dyld_all_image_infos entry 144626 pid95634 dyld _dyld_start entry 144627 pid95634 dyld dyld_fatal_error entry . . . * Find entry points for top: $ sudo dtrace -ln 'pid$target:::entry,pid$target:::return' -c top ID PROVIDERMODULE FUNCTION NAME 36684 pid95691 dyld _dyld_start entry 36685 pid95691 dyld std::vectordyld_image_info, std::allocatordyld_image_info ::erase(__gnu_cxx::__normal_iteratordyld_image_info*, std::vecto entry 36686 pid95691 dyld std::_Vector_basedyld_image_info, std::allocatordyld_image_info ::~_Vector_base() entry 36687 pid95691 dyld dyld_fatal_error entry . . . On Tue, Dec 7, 2010 at 12:27 PM, Jim Mauro james.ma...@oracle.com wrote: When you say straight from the dtrace book, I assume you mean the soon-to-be-published book? If that is the case, please allow me to clarify. Not every invocation of dtrace, both one-liners and scripts, that are in the book (and there are hundreds) is tested and guaranteed to work under all three operating systems (Solaris, OS X, FreeBSD). In fact, some will not work across different Solaris releases, depending on (for example) which provider is used. We are very clear in the book with our examples about the differences in the underlying implementations. There are a great many scripts and one-liners in the book that will only work on Solaris. The good news is that, in addition to the actual examples, there is a lot of emphasis on methodology in the book, such that you can use DTrace to figure out which probes/modules, etc, you should be instrumenting based on the problem
Re: [dtrace-discuss] Solaris Internals Resource Threshold being hit
Mike is correct. Pretty much every time I've seen this, it's VM (VM = virtual memory = swap) related. There's a DTrace script below you can run when you hit this problem that will show us which system call is failing with an EAGAIN error. It is most likely fork(2) (and yes, I know printing the errno in the return action is superfluous given we use it in the predicate - it's me being OCD and sanity checking). A second DTrace script further down should provide a kernel stack trace if it is a fork(2) failure. Or(disk is cheap) swap -a (add swap space) and see if the problem goes away. Thanks /jim #!/usr/sbin/dtrace -s #pragma D option quiet syscall:::entry { self-flag[probefunc] = 1; } syscall:::return /self-flag[probefunc] errno == 11/ { printf(syscall: %s, arg0: %d, arg1: %d, errno: %d\n\n,probefunc,arg0,arg1,errno); self-flag[probefunc] = 0; } #!/usr/sbin/dtrace -s #pragma D option quiet syscall::forksys:entry { self-flag = 1; @ks[stack(),ustack()] = count(); } syscall::forksys:return /self-flag arg0 == -1 errno != 0/ { printf(fork failed, errno: %d\n,errno); printa(@ks); clear(@ks); exit(0); } On Oct 29, 2010, at 12:00 PM, Robin Cotgrove wrote: I need some assistance and guidance in writing a DTRACE script or even better, finding an example one which would help me identify what's going on our system. Intermittently, and we think it might be happening after about 60 days, on a E2900, 192GB, 24 core, Solaris 10 11.06 system with a fairly new patch cluster (Generic_142900-13) we are running into a problem whereby we suddenly hit a problem which results in processes failing to start and getting the error message 'resource temporarily unavailable' error. This is leading to Oracle crash/startup issues. I ran a simple du command at the time it was happening at got the following response. ‘du: No more processes: Resource temporarily unavailable’ Approximately 6500 TCP connections on server at time. 6000 unix processes. The max UNIX processes per user is set to 29995. 60GB free physical memory and no swap being used. Absolutely baffling us at mo. Not managed to truss a failing command when it happened yet because it's so intermitttent in it's nature. We've checked all the usual suspects including max processes per users and cannot find the cause. Need a way to monitor all the internal kernel resources to see what we're hitting. Suggestions please on a postcard. All welcome. Robin Cotgrove -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Solaris Internals Resource Threshold being hit
Thanks Mike. Good point on the script. Indeed, use of speculative tracing would be a better fit here. I'll see if I can get something together and send it out. Thanks, /jim On Oct 29, 2010, at 4:45 PM, Mike Gerdts wrote: On Fri, Oct 29, 2010 at 2:50 PM, Robin Cotgrove ro...@rjcnet.co.uk wrote: Sorry guys. Swap is not the issue. We've had this confirmed by Oracle and I can clearly see there is 96GB of swap awailable on the system and ~50GB of main memory. By who at Oracle? Not everyone is equally qualified. I would tend to trust Jim Mauro (who co-wrote the books[1] on Solaris internals, performance, dtrace) over most of the people you will get to through normal support channels. 1. http://www.amazon.com/Jim-Mauro/e/B001ILM8NC/ How do you know that available swap doesn't momentarily drop? I've run into plenty of instances where a system has tens of gigabytes of free memory but is woefully short on reservable swap (virtual memory, as Jim approximates). Usually vmstat 1 is helpful in observing spikes, but as I said before this could miss very short spikes. If you've already done this to see that swap is unlikely to be an issue, knowing that would be useful to know. If you are measuring the amount of reservable swap with swap -l, you are doing it wrong. I do agree that there can be other shortfalls that can cause this. This may call for speculative tracing of stacks across the fork entry and return calls, displaying results only when the fork fails with EAGAIN. Jim's second script is similar to what I suggest, except that it doesn't show the code path taken between syscall::forksys:entry and syscall::forksys:return. Also, I would be a little careful running the second script as is for long periods of time if you have a lot of forksys activity with unique stacks. I think that as it is @ks may grow rather large over time because the successful forks are not cleared. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Distinguish filesystem from UNIX domain socket activity in syscall provider
You could first filter out the target file system for the file IO by doing a count() aggregation on; [fds[arg0].fi_fs] = count(); NOTE - this will only work for those system calls that take a file descriptor as arg0. Once you know the FS target for the file IO (ufs? zfs? whatever), use it in a predicate to capture systems calls of interest; / fds[arg0].fi_fs == ufs / (for example - replace ufs with whatever came up in the previous result. And, again, this will only work for the system calls that take a file descriptor as arg0. HTH /jim On Jul 13, 2010, at 7:10 AM, Ralph Böhme wrote: Dtracers, I'm investigating some performance issues using a simple dtrace script which collects read,write,stat,... syscall data. This works fine and gives me some good data. Unfortunately the app I'm tracing is concurrently doing file IO and IO on a UNIX domain socket. As I'm only interested in data for file IO, I'm investigating how I can modify my syscall probes. I'm currenty doing stuff like this: /* * Save syscall entry info */ syscall::*stat*:entry, syscall::*open*:entry, syscall::*read*:entry, syscall::*write*:entry, syscall::getdents*:entry, syscall::*sync*:entry /((OPT_command || OPT_pid) pid == $target) || (OPT_name execname == NAME)/ { /* set start details */ self-start = timestamp; self-vstart = vtimestamp; } /* * Print return data */ syscall::*stat*:return, syscall::*open*:return, syscall::*read*:return, syscall::*write*:return, syscall::getdents*:return, syscall::*sync*:return /self-start/ { /* calculate elapsed time */ this-elapsed = (timestamp - self-start) / 1000; self-start = 0; this-cpu = (vtimestamp - self-vstart) / 1000; self-vstart = 0; @count[probefunc] = count(); @sum[probefunc] = sum(this-elapsed); @elapsed[probefunc] = quantize(this-elapsed); @oncpu[probefunc] = quantize(this-cpu); @avg[probefunc] = avg(this-elapsed); } Any hint for me? Thanks! -Ralph -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] ?: DTrace of connection establishment attempts for destination address/port
Hi Steffen - I actually think that code did change quite a bit from s10 to nv. I'm not sure what you need to do, but you may want to grab Brendan's DTraceToolKit and have a look at tcptop and tcpsnoop, and have a look at how Brendan did it for s10. Thanks, /jim On May 5, 2010, at 8:21 AM, Steffen Weiberle wrote: Using Solaris 10, so no IP provider, I am trying to figure out when a connection attempt is failing in a client. The connect() statement takes a file descriptor, which is an integer and has no connection details. Since I don't know if the connect is closely preceded with a bind() or a new connection may be attempted with the same socket, and thus I won't be capturing the bin, I can't rely on it to get the sockaddr structure. (The event happens very infrequently, maybe once a day) So I am trying tcp_connect FBT probe. fbt::tcp_connect:entry { printf(execname: %s \n, execname); self-sockaddr = args[1]; printf(family: %x \n, self-sockaddr-sa_family); } However, I get # /var/tmp/connect.d dtrace: failed to compile script /var/tmp/connect.d: line 32: sa_family is not a member of struct msgb Note is says 'msgb', not 'sockaddr'. I got the args[1] from the OpenSolaris source code (and maybe that is my problem, since I am running on Solaris 10--did the code change that much?--seems unlikely) http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/tcp/tcp_socket.c#232 Any good way to grab the IPv4 address and port number on an entry so I can have it in the situation the connect() fails? Thanks Steffen ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] pid$target: Not enough space
Check out slides 17 and 18 of DTrace Tips, which you can find here; http://blogs.sun.com/bmc/entry/dtrace_tips_tricks_and_gotchas On May 4, 2010, at 6:23 PM, Jianhua Yang wrote: I wanted to find out what the single threaded process was doing with dtrace but it returned with Not enough spaceerror: # dtrace -ln 'pid$target:::entry,pid$target:::return {trace(timestamp);}' -p 25050 dtrace: invalid probe specifier pid$target:::entry,pid$target:::return {trace(timestamp);}: failed to create return probe for 'kdstf1011000km': Not enough space questions are: what does Not enough space means ? how to avoid such errors ? Thanks for your help ! Jianhua Yang, -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] matching the -module-
interrupts not being serviced? Do you mean device interrupts - NIC and HBA? Or higher-level interrupts, like errors? That seems a wee bit extreme, but intrstat should help here. Or use dtrace and determine which interrupt services routines are running. If you're referring to device interrupts, the service routines all have intr in the function name. j...@pae-g4-nv:~# dtrace -n 'fbt::*intr*:entry { @[probefunc] = count(); }' dtrace: description 'fbt::*intr*:entry ' matched 339 probes ^C The above probably seems extreme, but just run it for a few seconds and see which probes actually fire. Fine tune your probe spec from there. On Apr 28, 2010, at 10:14 AM, Steve Gonczi wrote: Do these post have some connection to the thread topic? BTW. the mdb settings recommended by Jim Mauro actually did not make a difference, I jumped to to wrong conclusion, based on an incorrect test setup. My sense is that interrupts are not being serviced. Assume that most cpu-s have been given a cli instruction, and just ignoring interrupts most of the time. Anyone has a suggestion on how to prove/disprove this? -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Very sluggish system, yet mpstat/vmstat shows little
Groan. I'm such an idiot. I should have been more precise in terms of where and when you get set this. Sorry folks. On Apr 22, 2010, at 11:11 AM, Richard Skelton wrote: Hi Jim, If I set set idle_cpu_prefer_mwait = 0 in /etc/system on a X2270 running Solaris 10 10/09 s10x_u8wos_08a X86 I get a kernel panic upon reboot :-( Is this setting only for recent versions of OpenSolaris ? Cheers Richard. -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] __lwp_park() showing up as most time-consuming when dtrace an application
It's used to put threads to sleep that are blocking on user locks (at least that's my recollection). Run prstat -Lmp PID_OF_APP) - what does the LCK column look like? Try running plockstat -A -p PID_OF_APP. Thanks, /jim Dtrace Email wrote: Hi, when doing dtrace on an appliction, __lwp_park() seems to be taking a lot of time. What does it really do? is it waiting for threads? Thanks, ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] about sched interrupt
PID 0 isn't really a process. It's the kernel - that's something of an oversimplification, but for the purposes of this discussion, just know that pid 0 is not a process. Interrupts do not target processes. Devices bind interrupts to CPU at initialization time. Use 'echo ::interrupts | mdb -k to determine which devices have interrupts bound to which CPUs, and/or intrstat. mpstat will show you interrupt rates per CPU as well. As I said, if you think interrupts are getting in the way, create a processor set with the CPU(s) handling high rates of interrupts, thereby fencing it off from running application code. Thanks, /jim Qihua Wu wrote: Hi Max/Jim, Based on your script, I created the following script to print which pid is interrupted how many times every 2 second. The result shows pid 0 is interrupted most: about 50k every 2 second, that's 25k per second. How could it be interrupted so frequently? For the clock interrupt, it only interrupts 100 times per second, which is much less than 25k per second. Where could it come from? The box has lots of network traffic from mysql, I think the network interrupt will cause the mysqld (pid 10490) to be interrupted instead of pid 0. #!/usr/sbin/dtrace -s interrupt-start { self-intr = curthread-t_intr; } interrupt-start /self-intr/ { @inter[self-intr-t_procp-p_pidp-pid_id, cpu] = count(); } interrupt-complete /self-intr/ { self-intr = 0; } profile:::tick-2sec { printf(report completed at time:%Y\n,walltimestamp); printf(***\n); printa(%d-%d: %...@d\n, @inter); trunc(@inter,0); } 0 89152 :tick-2sec report completed at time:2010 Jan 24 02:48:56 *** 0-0: 1 0-2: 1 0-8: 1 0-12: 1 0-14: 1 0-16: 1 0-20: 1 0-21: 1 0-22: 1 1095-7: 4 0-6: 5 0-5: 8 264-7: 8 10490-7: 4962 0-7: 49131 0 89152 :tick-2sec report completed at time:2010 Jan 24 02:48:58 *** 0-0: 1 0-5: 4 1095-7: 4 10490-7: 4200 0-7: 50305 0 89152 :tick-2sec report completed at time:2010 Jan 24 02:49:00 *** 0-3: 4 1095-7: 4 0-5: 8 25388-7: 818 10490-7: 1582 0-7: 51507 On Fri, Jan 22, 2010 at 4:21 PM, m...@bruningsystems.com mailto:m...@bruningsystems.com m...@bruningsystems.com mailto:m...@bruningsystems.com wrote: Hi Qihua Wu, Here is a script that shows process args for process that was interrupted, and cpu. It is easily modified to add pid. #!/usr/sbin/dtrace -s interrupt-start { self-intr = curthread-t_intr; } interrupt-start /self-intr/ { @[stringof(self-intr-t_procp-p_user.u_psargs), cpu] = count(); } interrupt-complete /self-intr/ { self-intr = 0; } max Qihua Wu wrote: So do we have a way to know what causes the interrupt (this may be got by intrstat) and which pid/execname is interrupted? Thanks Daniel On Thu, Jan 21, 2010 at 10:43 PM, Jim Mauro james.ma...@sun.com mailto:james.ma...@sun.com wrote: sched is the execname of the PID 0 process (run ps -e). The string sched gets plugged into the DTrace execname variable if the CPU is in an interrupt handler when a probe fires. CPU 0 is very likely taking the clock interrupts, which by default occur every 10 milliseconds. HTH, /jim Qihua Wu wrote: Why sched is often interrupt on cpu 0 instead of distribution evenly on all cpu? dtrace -n 'sdt:::interrupt-start { @num[execname,cpu] = count(); }' dtrace: description 'sdt:::interrupt-start ' matched 1 probe ^C sched 4 355 sched 18 358 sched 8 369 sched 10 370 sched 11 370 sched 12 408 sched 13 408 sched 15 408 sched
Re: [dtrace-discuss] about sched interrupt
According to your intrstat data, it looks like a nge network device and disk sata controller both have interrupts bound to cpu 7. Max provided a DTrace script to determine which process is having their thread(s) pinned when an interrupt comes in. If you're concerned about the possibility of a CPU handling a high rate of interrupts, and running application threads, causing a performance issue, just fence the CPU off. It's easy, and dynamic. For example, if cpu7 is handling most of the interrupts, just put it in a processor set; psrset -c 7 Once you've done that, the kernel will not schedule user threads on cpu7 unless you explicitly bind processes/threads to the processor set. By NOT doing any binding, you've effectively fenced off cpu7 - all it will do is handle interrupts. If it doesn't help, or makes things worse, just psradm -d 1 to delete the set. /jim Qihua Wu wrote: So do we have a way to know what causes the interrupt (this may be got by intrstat) and which pid/execname is interrupted? Thanks Daniel On Thu, Jan 21, 2010 at 10:43 PM, Jim Mauro james.ma...@sun.com mailto:james.ma...@sun.com wrote: sched is the execname of the PID 0 process (run ps -e). The string sched gets plugged into the DTrace execname variable if the CPU is in an interrupt handler when a probe fires. CPU 0 is very likely taking the clock interrupts, which by default occur every 10 milliseconds. HTH, /jim Qihua Wu wrote: Why sched is often interrupt on cpu 0 instead of distribution evenly on all cpu? dtrace -n 'sdt:::interrupt-start { @num[execname,cpu] = count(); }' dtrace: description 'sdt:::interrupt-start ' matched 1 probe ^C sched 4 355 sched18 358 sched 8 369 sched10 370 sched11 370 sched12 408 sched13 408 sched15 408 sched19 408 sched20 408 sched21 408 sched22 408 sched23 408 sched14 409 sched 9 410 sched 2 411 sched16 411 sched 6 422 sched17 456 sched 3 469 sched 5 510 sched 7 564 sched 1 3079 sched 0 535071 Thanks, Daniel ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org mailto:dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] about sched interrupt
sched is the execname of the PID 0 process (run ps -e). The string sched gets plugged into the DTrace execname variable if the CPU is in an interrupt handler when a probe fires. CPU 0 is very likely taking the clock interrupts, which by default occur every 10 milliseconds. HTH, /jim Qihua Wu wrote: Why sched is often interrupt on cpu 0 instead of distribution evenly on all cpu? dtrace -n 'sdt:::interrupt-start { @num[execname,cpu] = count(); }' dtrace: description 'sdt:::interrupt-start ' matched 1 probe ^C sched 4 355 sched 18 358 sched 8 369 sched 10 370 sched 11 370 sched 12 408 sched 13 408 sched 15 408 sched 19 408 sched 20 408 sched 21 408 sched 22 408 sched 23 408 sched 14 409 sched 9 410 sched 2 411 sched 16 411 sched 6 422 sched 17 456 sched 3 469 sched 5 510 sched 7 564 sched 1 3079 sched 0 535071 Thanks, Daniel ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Looking for help on 2 items...
iostat -Cx 1 is your friend. The -C flag will provide a rollup per controller (c1, c2, etc) so you can determine the IO rate on a per-controller basis (IOPS and bandwidth). I'd start there. DTrace rocks, but you should be able to answer this question with iostat. /jim Michael Brian - IL wrote: I am pretty new to Dtrace but use the Dtrace Toolkit when trying to troubleshoot I/O issues On Oracle. I am looking for help on how to do the following: I am trying to answer whether adding more HBA Cards/ports would be effective. To do this, I need to know the i/o's per second As well as total bandwidth per second. Has anyone done this before? Does anyone have any other ideas on how to attack this problem? I have been tuning Oracle for quite some time now, and I am continually Asked to prove what I tend to know naturally, that the classic 1 HBA, 2 port card Isn't cutting it. I also have similar discussions on whether I am saturating the BUS on a particular box. Brian -- Brian P Michael Technical Management Consultant Rolta TUSC, Inc. micha...@tusc.com 630-960-2909 x1181 http://www.tusc.com The information contained in this transmission is privileged and confidential information intended for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. Thank you. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] DTrace Book - rough cuts
(shameless plug). We have a DTrace book underway, and while the final product won't appear until Summer, 2010, we're leveraging the Safari Books OnLine Rough Cuts facility to make early drafts of chapters generally available. Right now, there are 3 chapters available, with more coming over the next few months. You can arrange for auto email notification for updates when there are changes. We welcome any and all feedback. Thanks, Jim Mauro Brendan Gregg Chad Mynhier Tariq Magdon-Ismail ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] DTrace Book - rough cuts
Hi Dave - yes, we plan on covering Java extensively, perhaps even in a dedicated chapter. That bit is not underway yet, but it's in the plan. Thanks, /jim David Blasingame, Area Systems Support Engineer wrote: Great plug. Are you considering having information on DTrace and JAVA in the book? Dave On 11/23/09 10:43, Jim Mauro wrote: (shameless plug). We have a DTrace book underway, and while the final product won't appear until Summer, 2010, we're leveraging the Safari Books OnLine Rough Cuts facility to make early drafts of chapters generally available. Right now, there are 3 chapters available, with more coming over the next few months. You can arrange for auto email notification for updates when there are changes. We welcome any and all feedback. Thanks, Jim Mauro Brendan Gregg Chad Mynhier Tariq Magdon-Ismail ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Who're stealing memory ?
Have to grabbed periodic snapshots with ps, or prstat? These will gave you a sense of which processes have large physical memory footprints, and you can pmap from there Thanks, /jim Simon wrote: Hi Jim, In order to be sure, you need to so some additional memory accounting and determine how much RAM you need to support the shared segments for Sybase, and the JVMs. It's difficult for me now since I don't kwow what is the really troublemaker to cause this issue,I guess the JVMs,and I will suggest reduce the share memory allocation for sybase,that control it in sybase configuration file,not in /etc/system. Thanks. Best Regards, Simon On Sun, Nov 22, 2009 at 11:33 AM, Jim Mauro james.ma...@sun.com mailto:james.ma...@sun.com wrote: You have about 9GB of shared memory (on a 16GB machine). From the prstat output,we found 3 sybase process,and each process derived 12 threads,the java process(launched by customer application) derived total 370 threads, I think it's too many threads(especially of java program) that generate excessive stack/heaps,and finally used up the RAM ? Java can consume a lot of memory. Need to see the memory sizes, but it's certainly a possibility. So I think decrease the share memory used by sybase(defined at sybase configuration layer,not in /etc/system file) would be helpful ? Sure. If you take memory away from one consumer, it leaves more for the others. Whether or not it actually solves your problem, meaning after such a change the system has sufficient memory to run without paging, remains to be seen. In order to be sure, you need to so some additional memory accounting and determine how much RAM you need to support the shared segments for Sybase, and the JVMs. Thanks, /jim ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Who're stealing memory ?
You have about 9GB of shared memory (on a 16GB machine). From the prstat output,we found 3 sybase process,and each process derived 12 threads,the java process(launched by customer application) derived total 370 threads, I think it's too many threads(especially of java program) that generate excessive stack/heaps,and finally used up the RAM ? Java can consume a lot of memory. Need to see the memory sizes, but it's certainly a possibility. So I think decrease the share memory used by sybase(defined at sybase configuration layer,not in /etc/system file) would be helpful ? Sure. If you take memory away from one consumer, it leaves more for the others. Whether or not it actually solves your problem, meaning after such a change the system has sufficient memory to run without paging, remains to be seen. In order to be sure, you need to so some additional memory accounting and determine how much RAM you need to support the shared segments for Sybase, and the JVMs. Thanks, /jim ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] [perf-discuss] Who're stealing memory ?
Right. All your memory appears to be anon segments - 13.4GB worth. About 9GB of that is the shared memory segments. That leaves 4.4GB. I see 13 Java processes listed. Assuming they have a similar memory footprint as the one pmap example, which shows about 40MB of RSS, that's (40MB x 13) about 500MB. It's of course possible that the other JVMs are using more memory than the 1 pmap example. The ps output or prstat would be helpful here. You still need to account for about 3.9GB of anon usage... Thanks, /jim Simon wrote: Hi Michael, Now the system been reset seems that process disappear,but I found some similar processes which launched by the same user kplus,as: kplus20905 0.1 1.1464984180288 ?S 08:18:38 2:17 /usr/java/bin/java -Dprogram.name=run.sh -server -Xms128m -Xmx512m -Dsun.rmi.dgc.client.gcInterval=360 -Dsun.rmi.dgc.server.gcInterval=360 -Djava.awt.headless=true -Djava.endorsed.dirs=/export/home1/jboss/jboss-4.0.5.GA/lib/endorsed http://jboss-4.0.5.GA/lib/endorsed -classpath /export/home1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/lib/tools.jar http://jboss-4.0.5.GA/bin/run.jar:/usr/java/lib/tools.jar org.jboss.Main Thanks. Best Regards, Simon On Sat, Nov 21, 2009 at 4:22 PM, Michael Schulte mschu...@sunspezialist.de mailto:mschu...@sunspezialist.de wrote: Hey Simon, # pmap -x 28447 28447: /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootc Can you give the full argument list of this java? Memory de-allocation is Java is fully asynchronous in Garbage Collection and can be tuned by command line options when starting the application. Just Google for the exakt syntax. Michael Simon wrote: Hi Jim, Thanks for your reply,here's my update: Did you check for additional virtual space usage in /tmp? df -k shows only 1% used in /tmp filesystem: swap 10943720 968 10942752 1%/tmp swap 10942832 80 10942752 1%/var/run Are you using ZFS (ARC space needed for that)? No any zfs used,all filesystems are UFS. You can also try using the ::memstat mdb dcmd to break out kernel memory further. ::memstat Page SummaryPagesMB %Tot Kernel 111925 8745% Anon 1715077 13399 83% Exec and libs 64697 5053% Page cache 71828 5613% Free (cachelist)51148 3992% Free (freelist) 43872 3422% Total 2058547 16082 Physical 2037012 15914 As above,the Anonymous memory is very high,I think some user thread using the memory in an abnormal way,I checked one of process with pmap -x and found many of stack/heap,as: # ps -ef |grep bea |grep -v grep kplustp 28447 1 0 07:01:37 ? 0:26 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28447 1 0 07:01:37 ? 0:26 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28443 1 0 07:01:37 ? 2:29 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28445 1 0 07:01:37 ? 1:24 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28457 1 0 07:01:38 ? 0:50 /export/home1/bea/jdk160_05//bin/java -Xms512m -Xmx1024m -Djava.awt.headless=tr kplustp 28453 1 0 07:01:37 ? 1:55 /export/home1/bea/jdk160_05//bin/java -Xms512m -Xmx1024m -Xbootclasspath/p:./.. kplustp 28449 1 0 07:01:37 ? 0:25 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28508 1 0 07:01:44 ? 1:15 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -classpath ./../ kplustp 28451 1 0 07:01:37 ? 1:25 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28455 1 0 07:01:37 ? 1:27 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28439 1 0 07:01:36 ? 0:28 /export/home1/bea/jdk160_05//bin/java -Djava.awt.headless=true -Xbootclasspath/ kplustp 28441 1 0 07:01:36 ? 0:26
Re: [dtrace-discuss] Who're stealing memory ?
If you're running out of memory, which it appears you are, you need to profile the memory consumers, and determine if you have either a memory leak somewhere, or an under-configured system. Note 16GB is really tiny by todays standards, especially for an M5000-class server. It's like putting an engine from a Ford sedan into an 18-wheel truck - the capacity to do work is severely limited by a lack of towing power. Laptops ship with 8GB these days... Back to memory consumers. We have; - The kernel - User processes - The file system cache (which is technically part of the kernel, but significant enough such that it should be measured seperately. If the database on a file system, and if so, which one (UFS? ZFS, VxFS?). How much shared memory is really being used (ipcs -a)? If the system starts off well, and degrades over time, then you need to capture memory data over time and see what area is growing. Based on that data, we can determine if something is leaking memory, or you have an underconfigured machine. I would start with; echo ::memstat | mdb -k ipcs -a ps -eo pid,vsz,rss,class,pri,fname,args prstat -c 1 30 kstat -n system_pages You need to collect that data and some regular interval with timestamps. The interval depends on how long it takes the machine to degrade. If the systems goes from fresh boot to degraded state in 1 hour, I'd collect the data every second. If the machine goes from fresh boot to degraded state in 1 week, I'd grab the data every 2 hours or so. /jim Simon wrote: Hi Experts, Here's the performance related question,please help to review what can I do to get the issue fixed ? IHAC who has one M5000 with Solaris 10 10/08(KJP: 13-01) installed and 16GB RAM configured,running sybase ASE 12.5 and JBOSS application,recently,they felt the OS got very slow after OS running for some sime,collected vmstat data points out memory shortage,as: # vmstat 5 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m4 m5 in sy cs us sy id 0 0 153 6953672 254552 228 228 1843 1218 1687 0 685 3 2 0 0 2334 32431 3143 1 1 97 0 0 153 6953672 259888 115 115 928 917 917 0 264 0 35 0 2 2208 62355 3332 7 3 90 0 0 153 6953672 255688 145 145 1168 1625 1625 0 1482 0 6 1 0 2088 40113 3070 2 1 96 0 0 153 6953640 256144 111 111 894 1371 1624 0 1124 0 6 0 0 2080 55278 3106 3 3 94 0 0 153 6953640 256048 241 241 1935 2585 3035 0 1009 0 18 0 0 2392 40643 3164 2 2 96 0 0 153 6953648 257112 236 235 1916 1710 1710 0 1223 0 7 0 0 2672 62582 3628 3 4 93 As above,the w column is very high all time,and sr column also kept very high,which indicates the page scanner is activated and busying for page out,but the CPU is very idle,checked /etc/system,found one improper entry: set shmsys:shminfo_shmmax = 0x So I think it's the improper share memory setting to cause too many physical RAM was reserved by application and suggest to adjustment the share memory to 8GB(0x2),but as customer feedback,seems it got worst result based on new vmstat output: kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m4 m5 in sy cs us sy id 0 6 762 3941344 515848 18 29 4544 0 0 0 0 4 562 0 1 2448 25687 3623 1 2 97 0 6 762 4235016 749616 66 21 4251 2 2 0 0 0 528 0 0 2508 50540 3733 2 5 93 0 6 762 4428080 889864 106 299 4694 0 0 0 0 1 573 0 7 2741 182274 3907 10 4 86 0 5 762 4136400 664888 19 174 4126 0 0 0 0 6 511 0 0 2968 241186 4417 18 9 73 0 7 762 3454280 193776 103 651 2526 3949 4860 0 121549 11 543 0 5 2808 149820 4164 10 12 78 0 9 762 3160424 186016 61 440 1803 7362 15047 0 189720 12 567 0 5 3101 119895 4125 6 13 81 0 6 762 3647456 403056 44 279 4260 331 331 0 243 10 540 0 3 2552 38374 3847 5 3 92 the w sr value increased instead,why ? And I also attached the prstat outout,it's a prstat snapshot after share memory adjustment,please help to have a look ? what can I do next to get the issue solved ? what's the possible factors to cause memory shortage again and again,even they have 16GB RAM + 16GB Swap the physical RAM really shortage? Or is there any useful dtrace script to trace the problem ? Thanks very much ! Best Regards, Simon ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] value of zero from count()
D'oh! Thanks Jonathan Adams wrote: On Tue, Oct 27, 2009 at 02:00:52PM -0400, Jim Mauro wrote: I've run into this from time to time. Simple example; #dtrace -n 'hotspot27563::: { @[probename]=count(); } tick-1sec { printa(@); clear(@); }' The sample output (below) shows a couple functions (probename) with a value of zero. By virtue of the fact that it appears in the aggregation at all implies that minimal value should be 1 (I would think). Clear clears all values in the aggregation to 0; to remove all elements from the aggregation, you want trunc(@a). (clear() can be useful if you want a running list of everything seen, but count only for the most recent hits) Cheers, - jonathan (FYI - this is on OSOL 2009.06, x86 system). Thanks, /jim gc-begin 0 gc-end0 mem-pool-gc-begin 0 mem-pool-gc-end 0 thread-start 3 thread-stop 3 gc-begin 0 gc-end0 mem-pool-gc-begin 0 mem-pool-gc-end 0 thread-start 4 thread-stop 4 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
I'm cross-posting to zfs-discuss, as this is now more of a ZFS query than a dtrace query at this point, and I'm not sure if all the ZFS experts are listening on dtrace-discuss (although they probably are... :^). The only thing that jumps out at me is the ARC size - 53.4GB, or most of your 64GB of RAM. This in-and-of-itself is not necessarily a bad thing - if there are no other memory consumers, let ZFS cache data in the ARC. But if something is coming along to flush dirty ARC pages periodically I found what looked to be an applicable bug; CR 6699438 zfs induces crosscall storm under heavy mapped sequential read workload but the stack signature for the above bug is different than yours, and it doesn't sound like your workload is doing mmap'd sequential reads. That said, I would be curious to know if your workload used mmap(), versus read/write? For the ZFS folks just seeing this, here's the stack frame; unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 Let's see what the fsstat and zpool iostat data looks like when this starts happening.. Thanks, /jim Jim Leonard wrote: It would also be interesting to see some snapshots of the ZFS arc kstats kstat -n arcstats Here you go, although I didn't see anything jump out (massive amounts of cache misses or something). Any immediate trouble spot? # kstat -n arcstats module: zfs instance: 0 name: arcstatsclass:misc c 53490612870 c_max 67636535296 c_min 8454566912 crtime 212.955493179 deleted 7267003 demand_data_hits179708165 demand_data_misses 189797 demand_metadata_hits9959277 demand_metadata_misses 194228 evict_skip 1709 hash_chain_max 9 hash_chains 205513 hash_collisions 9372169 hash_elements 851634 hash_elements_max 886509 hdr_size143082240 hits198822346 l2_abort_lowmem 0 l2_cksum_bad0 l2_evict_lock_retry 0 l2_evict_reading0 l2_feeds0 l2_free_on_write0 l2_hdr_size 0 l2_hits 0 l2_io_error 0 l2_misses 0 l2_rw_clash 0 l2_size 0 l2_writes_done 0 l2_writes_error 0 l2_writes_hdr_miss 0 l2_writes_sent 0 memory_throttle_count 0 mfu_ghost_hits 236508 mfu_hits165895558 misses 388618 mru_ghost_hits 70149 mru_hits24777390 mutex_miss 6094 p 49175731760 prefetch_data_hits 7993639 prefetch_data_misses370 prefetch_metadata_hits 1161265 prefetch_metadata_misses4223 recycle_miss7149 size53490565328 snaptime5759009.53378144 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
dtrace -n ':::xcalls { @s[stack()] = count() } tick-1sec { trunc(@s,10); printa(@s); clear(@s); }' That will tell us where the xcalls are coming from in the kernel, and we can go from there. Thanks, /jim Jim Leonard wrote: We have a 16-core x86 system that, at seemingly random intervals, will completely stop responding for several seconds. Running an mpstat 1 showed something horrifiying: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1004691 397 170 0 0 0 5 0 0 0 100 0 0 (rest of CPUs omitted) That's over a million cross-calls a second. Seeing them on CPU0 made me nervous that they were kernel-related, so I wrote a dtrace to print out xcalls per second aggregated by PID to see if a specific process was the culprit. Here's the output during another random system outage: 2009 Sep 22 12:51:49, load average: 5.90, 5.35, 5.39 xcalls: 637511 PIDXCALLCOUNT 616415 616515 28339 26 0 637455 PID 0 is sched (aka the kernel). At this point I'm completely stumped as to what could be causing this. Any hints or ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
As Dan said, it looks like ZFS is busy. How much RAM is on this system? What release of Solaris? Do you have any ZFS tweaks in /etc/system? (like clamping the ARC size...) Is the system memory constrained? The xcalls are due to the page unmaps out of what I'm assuming is the ZFS ARC (although I'm not 100% sure about that - I need to check the source for segkmem_zio_free()) Can you gather some ZFS IO statistics, like fsstat zfs 1 for a minute or so. Thanks, /jim Jim Leonard wrote: Thanks for the awesome two-liner, I'd been struggling with 1-second intervals without a full-blown script. I modified it to output walltime so that I could zoom in on the problem, and here it is: unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 418560 Any ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to drill down cause of cross-calls in the kernel? (output provided)
It would also be interesting to see some snapshots of the ZFS arc kstats kstat -n arcstats Thanks Jim Leonard wrote: Thanks for the awesome two-liner, I'd been struggling with 1-second intervals without a full-blown script. I modified it to output walltime so that I could zoom in on the problem, and here it is: unix`xc_do_call+0x8f unix`xc_wait_sync+0x36 unix`x86pte_invalidate_pfn+0x135 unix`hat_pte_unmap+0xa9 unix`hat_unload_callback+0x109 unix`hat_unload+0x2a unix`segkmem_free_vn+0x82 unix`segkmem_zio_free+0x10 genunix`vmem_xfree+0xee genunix`vmem_free+0x28 genunix`kmem_slab_destroy+0x80 genunix`kmem_slab_free+0x1be genunix`kmem_magazine_destroy+0x54 genunix`kmem_depot_ws_reap+0x4d genunix`taskq_thread+0xbc unix`thread_start+0x8 418560 Any ideas? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Pass argument to dtrace script for use as predicate
From http://wikis.sun.com/display/DTrace/Scripting; If you want your D macro arguments to be interpreted as string tokens even if they match the form of an integer or identifier, prefix the macro variable or argument name with two leading dollar signs (for example, $$1) to force the D compiler to interpret the argument value as if it were a string surrounded by double quotes. All the usual D string escape sequences (see Table 2–5) are expanded inside of any string macro arguments, regardless of whether they are referenced using the $arg or $$arg form of the macro. If the defaultargs option is set, unspecified arguments that are referenced with the $$arg form have the value of the empty string (). If you're passing a string, use $$1. /jim Stathis Kamperis wrote: 2009/9/12 Stathis Kamperis ekamp...@gmail.com: 2. If I leave $1 without double quotes and the execname doesn't have a dot in it, it fails with: $ pfexec ./lala.d temp dtrace: failed to compile script ./lala.d: line 11: failed to resolve temp: Unknown variable name Ok, I got lucky with google this time: http://solaris.reys.net/passing-command-line-parameters-to-dtrace-scripts I have to write: $ pfexec ./lala.d 'temp' for dtrace to treat it as string argument. 3. If I surround $1 with double quotes, it fails not matter what the supplied argument is: $ pfexec ./lala.d temp.sh dtrace: failed to compile script ./lala.d: line 31: extraneous argument 'temp.sh' ($1 is not referenced) $ pfexec ./lala.d temp dtrace: failed to compile script ./lala.d: line 31: extraneous argument 'temp' ($1 is not referenced I still can't figure this out. Does $1 inside a script make dtrace perceive it as a literal string '$1' ? Cheers, Stathis ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] LISA '09 Tutorials
In the shameless plug category I have two tutorials scheduled for the Usenix LISA '09 conference, running in Baltimore, Md, Nov 1-6, 2009. Sunday, Nov 1, is a full day DTrace tutorial. Monday, Nov 2, is a full day Solaris/OpenSolaris Performance, Observability and Tools tutorial. And of course, Richard Elling has a full day ZFS tutorial on Monday, Nov 2 (unfortunately, it does overlap with the Solaris performance tutorial). Jeff Victor is doing a full day of Resource Management with Solaris Containers on Thursday, Nov 5. Come one, come all! Thanks, /jim ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Check that memory is filled by zero or not after process dies
You're actually asking multiple questions here, because in order to verify if a particular operating system zero-fills memory pages when they are freed from an address space, you'd need to first know which kernel function is called to zero-fill the pages, right? I created a simple DTrace script to trace all kernel functions executed when a process exits; -- ex.d --- #!/usr/sbin/dtrace -s #pragma D option flowindent proc::proc_exit:exit / pid == $target / { self-trace = 1; } fbt::: / self-trace / { } -- This script uses the DTrace proc provider and enables the exit probe, which fires when a process exits. The predicate following the probe, / pid == $target /, tests that the PID is one I'm interested in. In this case, I'm running this script as (the script name is ex.d); #ex.d -c ./mem The -c flag instructs DTrace to plug the PID of the command into the $target macro. Very handy. The mem program is a quick 10 line C program I wrote that malloc's 50MB of RAM, touches every page, and exits. So when the proc::proc_exit:exit probe fires, and the predicate evaluates TRUE, meaning it's the PID I'm interested in, we set a thread local variable called trace to 1. The second probe entry, fbt:::, basically enables every fbt provider probe. The fbt provider allows us to instrument the entry and return point for pretty much every kernel function. With the trace flag set, and the flowindent option set in the script, when the process exits, I get an easy-to-read list of kernel functions called. None of this answers the question are memory pages zero filled when they are freed, because I don't (or at least did not) know the name of the kernel function the zero-fills memory pages. When I went through the kernel calls, nothing really jumped out at me as a zero-fill function. It was clear I was freeing memory - I saw as_free() (free an address space) which loops through the segments in the address space, calling the segment-specific unmap code, and eventually page_free(). But nothing that looked like it was zero'ing out freed pages. So... I know for sure that allocation of anonymous memory uses a ZFOD (zero fill on demand) mechanism, so I decided to have a look at that. DTrace makes this so easy - I have three probes called zfod available, that each reside in a different kernel function - anon_map_createpages(), anon_zero() and anon_map_getpages(). So I did some dtrace'ing on these functions to see what they called, and discovered pagezero(), which sounded a lot to me like a zero fill a memory page function. A quick look at the source code verified this. So I know empirically that memory pages are zero-filled when allocated (anon memory pages that is); # dtrace -n 'fbt::pagezero:entry / pid == $target / { @c = count(); }' -c ./mem dtrace: description 'fbt::pagezero:entry ' matched 1 probe dtrace: pid 5016 has exited 6642 In the dtrace above, I enabled the entry point to the kernel pagezero() function, and used the DTrace count() aggregating function to count the number of times I call it. I know my mem program allocates 50MB, which is 6400 8k memory pages. I see I called pagezero() 6642 times, which aligns pretty well with what I expected (other anon allocations happen when you start a process beyond my 50MB heap segment). As for process exit and freed pages, I have not found any indication that freed pages are zero'd. Mind you, this does not represent an exhaustive study (I only spent a few minutes on it). HTH, /jim Dmitry Afanasyev wrote: Hello. I've got a scrip for systemtap and Linux kernel, that check if physical memory is filled by zero after process dies. The scrip is attached. The probe is set in the free_pages_bulk function, which frees number of memory pages using _free_one_page. I can't set probe in _free_one_page, because it is inline and systemtap doesn't support it. The result is the address of first dirty(not filled by zero) page in the bulk. I need the same script for Solaris kernel. I guess that DTrace provides the same functionality as systemtap, but I do not familiar with Solaris kernel at all. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] CPU time and IO time
The easiest way to do this is using the sleep/wakeup probes in the sched provider. From the process/thread perspective, once they issue an IO, they sit on a sleep queue until the IO is completed, at which point they're issued a wakeup. io:start/io:::done is usefull for a system view, but that's not what you're interested in. You could also enable probes at the system call level, and time things likd syscall::read:entry to syscall::read:return, using predicates as needed to make sure you're measuring what you expect. I find your last sentence confusing; because the io event might take a long time to finish, before which, the program would have gone back to the cpu The thread that issued the IO will not go back on-cpu until the IO has been completed, unless the thread is using the AIO (async IO) facility. Thanks, /jim Hi, I am trying to characterize some benchmark programs using dtrace. The first step is to determine the time that a program spends on the CPU (doing computations), and the time it spends doing IO. The cpu time can be easily determined using the on-cpu and off-cpu probes. I am having trouble figuring out the io time. Using the io:start and io:done probes will not give me the correct numbers, because the io event might take a long time to finish, before which, the program would have gone back to the cpu..Any ideas? Thanks. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Question about ID from trace
It's the ID of the probe, not the provider. /jim Andrea Cucciarre' wrote: I guess that the ID you see it's the ID of the provider not the PID On 07/10/09 16:01, Robert Alatalo wrote: Hello, Trying to track down what application is causing the system to reboot by turning the uadmin system call's reboot into a panic, we are running the following script: -x-x- start of uadmin.d -x-x- #!/usr/sbin/dtrace -s #pragma D option destructive syscall::uadmin:entry /arg0 == 1 uid == 0/ { panic(); } syscall::uadmin:entry { trace(I am not root) } -x-x- end of uadmin.d -x-x- We are getting a string of output: dtrace: script '/usr/local/bin/uadmin.d' matched 2 probes dtrace: allowing destructive actions CPU IDFUNCTION:NAME 258 8063 uadmin:entry I am not root 258 8063 uadmin:entry I am not root 258 8063 uadmin:entry I am not root 258 8063 uadmin:entry I am not root 258 8063 uadmin:entry I am not root 261 8063 uadmin:entry I am not root 256 8063 uadmin:entry I am not root 256 8063 uadmin:entry I am not root 259 8063 uadmin:entry I am not root 257 8063 uadmin:entry I am not root 257 8063 uadmin:entry I am not root 256 8063 uadmin:entry I am not root 256 8063 uadmin:entry I am not root 256 8063 uadmin:entry I am not root 257 8063 uadmin:entry I am not root 258 8063 uadmin:entry I am not root ... ... I thought that the ID should be a PID, but # ps -ef|grep 8063 dores not find it Any suggestions on how to track down this rouge entity which is futilely trying to do something with uadmin? thanks, Robert ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace performance overhead
Try this; #!/usr/sbin/dtrace -s #pragma D option quiet extern int errno; syscall::forkall:return, syscall::vfork:return, syscall::forksys:return, syscall::fork1:return / arg0 == -1 || arg1 == -1 / { printf(FORKED FAILED, errno: %d, arg0: %d, arg1: %d\n,errno, arg0, arg1); } ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace performance overhead
not enough space indicates an errno 28 ENOSPC, which isn't listed is the fork man page under ERRORS. Are you sure it's fork(2) that's failing? It may be errno 12, ENOMEM. So what does a general memory health profile of the system look like? Lots of free memory? Plenty of swap space? How about the process executing the forks - is it a 32-bit or 64-bit process? If 32-bit, is it hitting the limits of it's address space? /jim tester wrote: Hi Jim, The app software doesn't poduce a errno in its logs (bad software, although from a leading vendor, I think they inherited it, but a error string says not enough space I tried grepping some of the header files but could not find a match. /var/adm/messages: that's the first thing I looked at, but no help there either. I tried to match the error string in src.opensolaris.org, some hits turned up but I wasn't confident they were a result of this. Thank you. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace performance overhead
D'oh! Disregard that last question (address space) - my brain was thinking thread create failures - it's not applicable to fork failures. My bad. The system memory and swap space health checks still apply, as well as process count - grab some sar -v 1 60 samples /jim Jim Mauro wrote: not enough space indicates an errno 28 ENOSPC, which isn't listed is the fork man page under ERRORS. Are you sure it's fork(2) that's failing? It may be errno 12, ENOMEM. So what does a general memory health profile of the system look like? Lots of free memory? Plenty of swap space? How about the process executing the forks - is it a 32-bit or 64-bit process? If 32-bit, is it hitting the limits of it's address space? /jim tester wrote: Hi Jim, The app software doesn't poduce a errno in its logs (bad software, although from a leading vendor, I think they inherited it, but a error string says not enough space I tried grepping some of the header files but could not find a match. /var/adm/messages: that's the first thing I looked at, but no help there either. I tried to match the error string in src.opensolaris.org, some hits turned up but I wasn't confident they were a result of this. Thank you. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace performance overhead
Which example are you using, specopen.d, /*the script that instruments every fbt probe*/? Please post or be more precise about which script you're using. If you're using specopen.d, than you're enabling on the order of 30,000 probes. That's going to add up, even at the very reasonable cost of about 1 microsecond per probe (rough approximation). In general, the cost of using DTrace in terms of probe effect is directly proprotional to two things; - the number of probes you enable. - the activity of the code path of the enabled probe(s). In other words, if you instrument the fork system call, and the system is not executing forks, you won't feel a thing. If, on the other hand, the system is going 1000 forks per second, you'll feel it. The specopen.d script enables all the fbt probes. Even with the predicate ( / self-spec / ), the probe still has to fire before the predicate gets evaluated. HTH, /jim tester wrote: Hi, On a T5220, when using the speculative tracing there is a signifcant increase on system load. I am using the examples from http://wikis.sun.com/display/DTrace/Speculative+Tracing The system call traced is fork instead of open64. Can that script cause such a load? The system itself without the script is moderately loaded (usr+sys ~30-40). Also, I have to bump up the number of speculative buffers to 10 to avoid some of the warnings. What is a good rule to set the number of speculative buffers? In this case is it based on the number of fork calls that could be simultaneously traced? Thanks ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace performance overhead
I'm sorry, but I am unable to parse this. What is the question here? Thanks, /jim tester wrote: counting system call process during this interval: Dtrace came on top ioctl dtrace 10609 I am sure if that is from the speculative dtrace script or the script used to count the system calls. Thanks ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace performance overhead
Ah, OK - I think I get it. tester wrote: counting system call process during this interval: Dtrace came on top ioctl dtrace 10609 Got it. DTrace executed 10,609 system calls during your sampling period, more than any other process. I often filter dtrace out in a predicate; / execname != dtrace / I am sure if that is from the speculative dtrace script or the script used to count the system calls. Both. If you want to know precisely which one, you can use pid as an aggregation key in the count aggregation, and correlate the pid to the script you're starting. I'm not sure what you're chasing here though... /jim Thanks ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] DBWR write performance
I would start with lockstat to determine if there's RW lock contention (and lockstat is a DTrace consumer). #lockstat -e4-7,34-35 sleep 60 /var/tmp/rwlocks.out The above will collect events on kernel reader/writer locks (run lockstat -h to get a description of each event). With that data, we can see if there are long RW lock hold times, and the VxFS kernel routine issuing the call. From that output, we should be able to write a D script that does what you're interested in doing - per-file RW lock hold time tracking. I'm thinking about this now, and will try to get this written for UFS (since I do not have VxFS available). Let us know what the lockstat output looks like, and we'll go from there. Thanks, /jim Michael Jessen wrote: I'm an Oracle DBA, a dtrace noob, and I'm attempting to resolve an Oracle write performance problem on Solaris 10, Oracle 9iR2. The Oracle DBWR processes have poor write throughput writing to VxFS files. I suspect that the problem may be with POSIX readers/writer file locking. Ideally I'd like to identify, for a given DBWR process, readers/writer file lock contention on a datafile by datafile basis. Any guidance would be greatly appreciated, Kind regards, Michael ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] fbt with predicates and [O]Sol on VBox
FYI - I just tried this in OpenSolaris 2008.11, running in a Vbox (2.2) virtual machine. It's noisy without the predicate for the fbt probes (naturally :^), but it doesn't hang. (Vbox on a Mac host, FWIW). Thanks, /jim Michael Ernest wrote: I've been playing with a follow script example and getting a system hang every time. Turns out when I try: syscall::read:entry /execname == ls / { self-tr = 1; } fbt::: {} syscall::read:return /self-tr / { self-tr = 0; } I get (unusable) output. But when I add /self-tr / to my fbt::: clause, it hangs the instance every time. I've tried this on Sol u6, Sol u7, OSol b110, OSol Nov2008, anywhere from 1-2 GB RAM. I don't a native Solaris instance available today, although I have no reason to believe I'd see this issue on it. Any help? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] tcp listen backlog queue
Hey Paul - Add this predicate; / arg0 != (conn_t *)0 / Talk soon... /jim Paul Mininni wrote: Hey everyone- With this dtrace script; #!/usr/sbin/dtrace -qs #pragma D option aggsize=512k #pragma D option bufsize=512k fbt::tcp_conn_request:entry { this-connp = (conn_t *)arg0; this-tcpp = (tcp_t *)this-connp-conn_tcp; printf(tcp q: %d max: %d\n, this-tcpp-tcp_conn_req_cnt_q, this-tcpp-tcp_conn_req_max ); } I get this error upon incoming tcp connection; dtrace: error on enabled probe ID 1 (ID 36240: fbt:ip:tcp_conn_request:entry): invalid address (0x0) in action #3 at DIF offset 12 Source indicates this... http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/tcp/tcp.c 5275 tcp_conn_request(void *arg, mblk_t *mp, void *arg2) 5276 { ... 5288 conn_t *connp = (conn_t *)arg; 5289 tcp_t *tcp = connp-conn_tcp; ... 5300 if (tcp-tcp_conn_req_cnt_q = tcp-tcp_conn_req_max) { ... Anyone have an idea why this error shows up? Thanx -Paul ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Applications I/O
You're tripping over the fact the these disk IOs are happening asynchronously to the process/thread that initiated them. The dd(1) process has long since been placed on a sleep queue by the time you're hitting the ARC code, which is why execname is sched (the execname of PID 0 - the user process is not running - it's not on-cpu when the probe fires). If you want to look at the disk IO load being generated by an application, use the syscall provider, or the fsinfo provider. You can't chase IO's and correlate them to user processes once you get below fsinfo - they're off CPU and sleeping while the kernel goes off and deals with the disk IOs. The IO size will typically change as you move down the stack, unless the application issues IOs that align with the underlying file system record size, but even then things like prefetching will result in the physical IO size not necessarily aligning with the number of bytes the application tried to read or write. Starting at the bottom - if you want to monitor physical disk IO rates, use the DTrace io provider, and/or iostat. If you want to track IO rates through a file system, use the DTrace fsinfo provider, and/or the fsstat command. If you want to track disk IO's generating by applications, start with the syscall provider. It's possible to connect the dots up and down the stack, and there should be examples of that in iotop in Brendan's DTraceToolkit. Thanks, /jim Ferrand wrote: Hye, I still try to access application side I/O size on a ZFS file system. ( example : if /var/zfs_fs is a ZFS file system /var/zfs_fs:# dd if=/dev/urandom of=file bs=4k count=100 i try to get an information simple as execname == dd block_size = 4096 in order to trace a specific application I/O profile I've tried with both fbt::arc_access, fbt::arc_write, fbt::arc_buf_evict, fbt::arc_change_state, fbt::zio_create, ..., but i can't retrieve this information. Some times, block_size doesn't match 4k (e.g 128k, ..., or 0 with the zio_create probe) Some times, execname is sched and hide the application name wich initiate the I/O. I've read a lot of the ZFS ans ARC source code, but can't find any DTrace probe that provides me those informations. Please, need some Help R. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] fsflush writes very slow
Cross-posted to perf-discuss. You can't change the write behavior of the app without changing the app itself. The code would need to be modified to issue fsync() calls on the file(s), or open the files for synchronous writes (O_SYNC | O_DSYNC flags). fsflush will run, by default, once per second, which is the fastest rate it can be tuned for. The actual size of the IO is driven by several factors. fsflush is is writing 4kB or 8kB because the base page size on x86 is 4k, so fsflush is flushing 1 or 2 pages at a time. You can use truss or dtrace to determine what size writes the application is issuing; #dtrace -n 'syscall::write:entry { @sizes = quantize(arg2); }' I assume the application is issuing 32k writes, which would correspond to what you observed with dtrace. If you're looking to optimize write performance, and need to ensure data is commited to disk, you could turn directio on for the file system, which will bypass the page cache and write straight to disk. This can really help, or it can really hurt, depending on to what extent hitting in the page cache is helping performance (like on the read side). Getting to your specific issue, 6-8MB/sec does not sound like much, but what if that's all the application is generating? In other words, using a different load (dd), you observed much higher bandwidth with a large, sequential write, so you know the underlying platform is capable of delivering more bandwidth. At the application level, the app is doing smaller, random writes. What does the iostat -zxnd 1 data look like? Are you getting good latency? If the app is generating 32k IOs, and you're getting 6-8MB/sec (lets call it 7), that means your generating (7MB / 32k) about 220 IOPS. Depending on what the underlying disk storage is, this may be a very good number, but I need to know more about the disk(s) the UFS is sitting on. Thanks, /jim K Kelley wrote: I'm troubleshooting an I/O performance problem with one of our applications that does a lot of writing, generally blocks just over 32K, sequentially writing large files. It's a Solaris 10 x86 system with UFS disk. We're often only seeing disk write throughput of around 6-8MB/s, even when there is minimal read activity. Running iosnoop shows that most of the physical writes are made by the actual app and average around 32KB. About 15% of the data, however, is done by fsflush and only 4 or 8KB at a time. The write throughput for the fsflush writes is about 10% that of the app writes (using the DTIME values and aggregating the results to get totals). CPU resources are not a bottleneck. If I turn off dopageflush the overall rate jumps to 18-20MB/s. However, this would mean that the file data may not get flushed for a very long time, so is not a suitable option for production environments. If I do a dd (to create a file of 1GB, the amount of system memory), even with a block size that matches our app, it does much larger block writes, often 1M, the overall rate is around 60MB/s, and there were very few writes by fsflush. Is there any way to cause more of the physical writes to be done by the app rather than fsflush (and for that matter, what determines when fsflush does the flushing rather than the app?)? ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Introducing DTrace to a newbie audience
# dtrace -qn 'syscall:::exec-success { trace (execname); }' ^ The exec-success probe is managed by the proc provider, not the syscall provider. So the probe designation should be; proc:::exec-success (or just 'exec-success'). (for the newbies). Thanks, /jim ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] DTrace and I/O
You're looking at byte counts, not block sizes. 56kb sounds typical for UFS, which uses an 8k block size, with 1k frags (default), so you'll typically see IO sizes to/from UFS in multiples of 8k. The actually amount of IO depends of course on several factors. You can also just use iostat data. For reads, divide kbytes reads by reads-per-second to get the physical IO size for a given row of data. (same for writes). /jim Ferrand wrote: Hye everyboy, I try to obtain the REAL size of I/O in block or Ko in order to know how ZFS manage the block size. I've tried the io provider with this-byte_size = args[0]-b_bcount; this-kibi_byte_size = this-byte_size/1024; but my results seems strange ( block of 56 ko with UFS ... ) How can i retrieve REAL block size for I/O ? Thanks R. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] Don't Shout at your JBODs
http://www.youtube.com/watch?v=tDacjrSCeq4 I wonder if the inverse is true. If I whisper soothing words of encouragement at my JBODs, will I get more IOPS with reduced latency? :^) ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] disk utilization is over 200%
This is all very oddiostat is historically extremely reliable. I've never observed stats like that before - zero reads and writes with a non-zero value in the wait queue (forget utilization when it comes to disk - it's a useless metric). IO rates per process are best measured at the VOP layer. Depending on what version of Solaris you're running, you can use the fsinfo provider (fsinfo::fop_read:entry, fsinfo::fop_write:entry). If you don't have the fsinfo provider, instrument the syscall layer to track reads and writes. Can we get another sample, using iostat -zxnd 1 20? Does the application recover from the hang, or does it remain hung and require kill/restart? Thanks, /jim Jianhua Yang wrote: Hello, I use Brendan's sysperfstat script to see the overall system performance and found the the disk utilization is over 100: 15:51:38 14.52 15.01 200.00 24.42 0.00 0.00 83.53 0.00 15:51:42 11.37 15.01 200.00 25.48 0.00 0.00 88.43 0.00 -- Utilisation -- -- Saturation -- Time %CPU %Mem *%Disk* %Net CPU Mem *Disk* Net 15:51:45 11.01 15.01* 200.00* 12.02 0.00 0.00 *95.03* 0.00 15:51:48 13.80 15.01 *200.00* 24.87 0.00 0.00 *98.86* 0.00 15:51:51 9.44 15.01 *200.00* 17.02 0.00 0.00 *102.64* 0.00 15:51:54 9.49 15.01 *164.59* 9.10 0.00 0.00 *83.75* 0.00 15:51:57 16.58 15.01 *2.83* 20.46 0.00 0.00 0.00 0.00 how can I fix this ? is there new verion of this script ? my system is X4600-M1 with hardware RAID of 0+1 = OS disk =72 GB = d0 2+3 = apps data disk = 146 GB = d2, SVM soft partition with one UFS file system is active at that time, iostat showed strange output: cpu us sy wt id 13 9 0 78 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 335.0 0.0 0.0 0 100 md/d30 0.0 0.0 0.0 0.0 0.0 335.0 0.0 0.0 0 100 md/d40 0.0 0.0 0.0 0.0 0.0 335.0 0.0 0.0 0 100 md/d52 0.0 0.0 0.0 0.0 334.0 1.0 0.0 0.0 100 100 c3t2d0 cpu us sy wt id 10 5 0 85 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 335.0 0.0 0.0 0 100 md/d30 0.0 0.0 0.0 0.0 0.0 335.0 0.0 0.0 0 100 md/d40 0.0 0.0 0.0 0.0 0.0 335.0 0.0 0.0 0 100 md/d52 0.0 0.0 0.0 0.0 334.0 1.0 0.0 0.0 100 100 c3t2d0 kr/s kw/s show 0, but wait is 334, at this, the application always hang. # dtrace -n 'io:::start { @files[pid, execname, args[2]-fi_pathname] = sum(args[0]-b_bcount); } tick-5sec { exit(); }' dtrace: description 'io:::start ' matched 7 probes CPU ID FUNCTION:NAME 8 49675 :tick-5sec 16189 nTrade /export/data/dbxpt3/logs/ledgers/arinapt3.NTRPT3-MOCA.trans_outmsg.ledger 32768 25456 pt_chmod /export/data/dbxpt3/logs/NTRPT3-MOCA.log 32768 3 fsflush none 38912 25418 pt_chmod /export/data/dbxpt3/logs/NTRPT3-MOCA.log 49152 21372 tail /export/data/dbxpt3/logs/NTRPT3-MOCA.log 65536 16189 nTrade /export/data/dbxpt3/logs/ledgers/arinapt3.NTRPT3-MOCA.trans_exerep.ledger 81920 16189 nTrade /export/data/dbxpt3/logs/ntrade.imbalances.log 114688 25419 iostat /export/data/dbxpt3/logs/NTRPT3-MOCA.log 114688 8018 tail /export/data/dbxpt3/logs/NTRPT3-MOCA.log 131072 24915 tail /export/data/dbxpt3/logs/NTRPT3-MOCA.log 147456 16189 nTrade none 207872 20900 tail /export/data/dbxpt3/logs/NTRPT3-MOCA.log 270336 0 sched none 782336 16189 nTrade /export/data/dbxpt3/logs/NTRPT3-MOCA.log 2162688 the write is about 10MB/s, did the above dtrace script tell the real IO going on at that time ? is there a way to find how many IO generate by processes, and how many IO are in the IO wait queue ? is there a way to find out the disk RPM besides checking the physical drive ? Thanks, James Yang --- This communication may contain confidential and/or privileged information. If you are not the intended recipient (or have received this communication in error) please notify the sender immediately and destroy this communication. Any unauthorized copying, disclosure or distribution of the material in this communication is strictly forbidden. Deutsche Bank does not render legal or tax advice, and the information contained in this communication should not be regarded as such. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Is the nfs dtrace script right (from nfsv3 provider wiki)?
No bug here - we can absolutely use DTrace on MP systems, reliably and with confidence. The script output shows some nasty outliers for a small percentage of the reads and writes happening on the server. Time to take a closer look at the IO subsystem. I'd start with iostat -znx 1, and see what the queues look like, IO rates, and service times. I'd also have a look at the network. Download nicstat and run it (go to blogs.sun.com and search for nicstat - it's easy to find). What are you using for an on-disk file system? UFS or ZFS? /jim Marcelo Leal wrote: Ok, but that is a bug, or should work like that? We can not use dtrace on multiple processors systems? Sorry, but i don't get it... __ Leal [http://www.eall.com.br/blog] ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Is the nfs dtrace script right (from nfsv3 provider wiki)?
HmmmSomething is certainly wrong. 11 writes at 137k - 275k seconds (which is where your 1.5M seconds sum is coming from) is bogus. What version of Solaris is this ('uname -a' and 'cat /etc/release')? Your running this on an NFS server, right (not client)? Is this a benchmark? I ask because the file names (file1, file2, etc) seem like names used in synthetic benchmarks. I need to take a closer look at the code and figure out how to chase this. Either there's a bug somewhere, or we're missing something. Thanks, /jim Marcelo Leal wrote: Hello Jim! Actually i can repeat it... every time i did run some d script to collect some data i got some (how do you call it? nasty :) values. Look: Fri Dec 5 10:19:32 BRST 2008 Fri Dec 5 10:29:34 BRST 2008 NFSv3 read/write distributions (us): read value - Distribution - count 2 | 0 4 | 1092 8 |@93773 16 | 64481 32 |@@ 11713 64 |@7590 128 | 1156 256 | 698 512 | 1394 1024 | 1729 2048 | 805 4096 |@2732 8192 |@@@ 14893 16384 |@@ 9351 32768 |@2988 65536 | 647 131072 | 119 262144 | 29 524288 | 30 1048576 | 28 2097152 | 0 write value - Distribution - count 64 | 0 128 | 8 256 | 2418 512 |@@@ 22679 1024 | 28442 2048 | 59887 4096 |@68852 8192 |@65152 16384 | 32224 32768 |@@ 11554 65536 | 3162 131072 | 1100 262144 | 446 524288 | 105 1048576 | 70 2097152 | 11 4194304 | 0 8388608 | 0 16777216 | 0 33554432 | 0 67108864 | 0 134217728 | 0 268435456 | 0 536870912 | 0 1073741824 | 0 2147483648 | 0 4294967296 | 0 8589934592 | 0 17179869184 | 0 34359738368 | 0 68719476736 | 0 137438953472 | 11 274877906944 | 0 NFSv3 read/write by host (total us): x.16.0.x 4707246 x.16.0.x 28397213 x.16.0.x 40901275 x.16.0.x 68333664 x.16.0.x 89357734 x.16.0.x 125890329 x.16.0.x 127848295 x.16.0.x
Re: [dtrace-discuss] Is the nfs dtrace script right (from nfsv3 provider wiki)?
Also (I meant to ask) - are you having performance problems, or just monitoring with the NFS provider scripts? Thanks, /jim Marcelo Leal wrote: Hello Jim, this is not a benchmark. The filenames i did change for privacy... This is a NFS server, yes. # uname -a SunOS test 5.11 snv_89 i86pc i386 i86pc # cat /etc/release Solaris Express Community Edition snv_89 X86 Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 06 May 2008 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Is the nfs dtrace script right (from nfsv3 provider wiki)?
D'oh! Good spot Max - feeling sheepish that I missed that. Marcelo - add the predicate to the done probes as per Max's message, and let's see where that takes us Thanks, /jim [EMAIL PROTECTED] wrote: Hi, I have looked at the script, and there is no correspondence between start and done. So, I am not sure how this script is supposed to work. I think there should be a predicate in the done probes... The way the script is written, it assumes that for any start, the done that fires after it is for the same noi_xid. Current script: nfsv3:::op-read-start, nfsv3:::op-write-start { start[args[1]-noi_xid] = timestamp; } nfsv3:::op-read-done, nfsv3:::op-write-done { ... Changed script: nfsv3:::op-read-start, nfsv3:::op-write-start { start[args[1]-noi_xid] = timestamp; } nfsv3:::op-read-done, nfsv3:::op-write-done /start[args[1]-noi_xid] != 0/ { That way, you don't have the done probe clause executing for id's where the start has not fired first. (This still does not match start/done for a given xid). But what do I know... max Jim Mauro wrote: Also (I meant to ask) - are you having performance problems, or just monitoring with the NFS provider scripts? Thanks, /jim Marcelo Leal wrote: Hello Jim, this is not a benchmark. The filenames i did change for privacy... This is a NFS server, yes. # uname -a SunOS test 5.11 snv_89 i86pc i386 i86pc # cat /etc/release Solaris Express Community Edition snv_89 X86 Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 06 May 2008 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Is the nfs dtrace script right (from nfsv3 provider wiki)?
Are you referring to nfsv3rwsnoop.d? The TIME(us) value from that script is not a latency measurement, it's just a time stamp. If you're referring to a different script, let us know specifically which script. /jim Marcelo Leal wrote: Hello there, Ten minutes of trace (latency), using the nfs dtrace script from nfsv3 provider wiki page, i got total numbers (us) like: 131175486635 ??? thanks! ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Trying to identify writer and/or reason for iowrite.
The problem you're running into is disk IO operations tend to occur asynchronously to the thread that initiated the IO, so when the IO provider probe fires, execname shows the process name for PID 0. This is not uncommon when chasing disk and network IOs. You need to capture the write further up the stack. The easiest method for determining which process(es) are writing is to use either the syscall provider, or the fsinfo provider (depending on which release of Solaris you're running, fsinfo may not be there). Use the syscall provider and see which system calls are being used to generate disk writes - it's probably write(2), but it may be any of; nv98 dtrace -l -P syscall | grep write 76295syscall write entry 76296syscall write return 76497syscall writev entry 76498syscall writev return 76597syscall pwrite entry 76598syscall pwrite return 76691syscallpwrite64 entry 76692syscallpwrite64 return #!/usr/sbin/dtrace -s syscall::write:entry,syscall::writev:entry,syscall::pwrite:entry,syscall::pwrite64:entry { @[pid,probefunc] = count(); } Once you have the correct system call and process name(s), fine tune the DTrace and grab a user stack(); #!/usr/sbin/dtrace -s syscall::write:entry / pid == PID_OF_INTEREST / { @[ustack()] = count(); } The above assumes it's all write(2) system calls. You can determine which files by grabbing arg0 when the probe fires. Depending on which release of Solaris you're running, you can use the fds array to get the file path; #!/usr/sbin/dtrace -s syscall::write:entry { @[execname, fds[arg0].fi_pathname] = count(); } If your version of Solaris is older, and does not have fds available, just track arg0 (the file descriptor), and run pfiles on the process to map the file descriptor to the file. HTH, /jim Robert Alatalo wrote: Hello, Running the iotop dtrace script, we see many entries like UIDPID PPID CMD DEVICE MAJ MIN DBYTES 0 3 0 fsflush md3 85 3 W 512 ... ... 0 0 0 schedssd22 118 176 W 80538112 0 0 0 schedssd18 118 144 W 80585728 0 0 0 schedssd24 118 192 W 80730624 0 0 0 schedssd19 118 152 W 80762880 0 0 0 schedssd23 118 184 W 80764928 0 0 0 schedssd25 118 200 W 80965632 0 0 0 schedssd20 118 160 W 81029120 0 0 0 schedssd21 118 168 W 81132032 In the iostat we see things like device r/sw/s kr/s kw/s wait actv svc_t %w %b ... ... ssd180.0 157.20.0 17914.5 9.2 13.3 142.9 63 75 ssd190.0 161.40.0 17887.7 9.6 13.7 144.6 65 76 ssd200.0 166.40.0 17922.0 8.3 12.7 126.1 58 74 ssd210.0 157.60.0 17880.0 9.3 13.4 144.1 64 75 ssd220.0 153.30.0 17867.7 9.2 13.4 147.3 63 75 ssd230.0 154.30.0 17905.6 8.8 13.0 141.5 61 74 ssd240.0 160.42.1 17915.4 9.2 13.4 141.3 63 75 ssd250.0 160.70.0 17934.5 9.7 13.8 145.8 66 76 Can anyone suggest a different script, or help modify the iotop as included in the tookkit so as to either print stack traces or better yet, print stack traces when the %busy goes above some threshhold, or other way of limiting the data to go though? thanks in advance, Robert ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Do you have directio enabled on UFS? Especially for the redo logs? With directio enabled, UFS writes to the log do not serialize on the RW lock for the log file(s). directio will also bypass the memory cache, so you need to increase the Oracle db_block_buffers when enabling UFS directio. Thanks, /jim Hans-Peter wrote: Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
For the record, my friend Phil Harman reminded me that it's not the log files we care about for directio in terms of single-writer lock break-up. We care about directio for redo logs to avoid read-modify-write, which happens when the write is not memory-page aligned. Sorry about that. Nonetheless, you must still pursue getting UFS directio enabled and working on your system (dealing with the single-writer lock problem is still important, it's just not directly applicable to the redo logs). Thanks, /jim Hans-Peter wrote: Do you have directio enabled on UFS? Especially for the redo logs? That is the strange thing. filesystemio_options has been set to setall which is asynch and directio. But when I dtrace the fbt calls I see only directio calls for the database writer but not for the log writer. Perhaps we should mount the filesystems with forcedirectio. Would noatime also help? Regards With directio enabled, UFS writes to the log do not serialize on the RW lock for the log file(s). directio will also bypass the memory cache, so you need to increase the Oracle db_block_buffers when enabling UFS directio. Thanks, /jim Hans-Peter wrote: Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
%busy is meaningless unless you're looking at a single disk that can only have 1 outstanding IO in it's active queue, which is to say %busy is a useless metric for anything disk that's been designed and built in the last decade. Ignore %busy. Focus on queue depths and queue service times, both of which are reported in iotstat. And again, averages can be misleading. Use the iostat data as a starting point, but if you really think you have some disk IO issues, use DTrace and aggregations to get more precise data on disk IO times. Thanks, /jim Hans-Peter wrote: Hello Jim, The iostat average are not too bad. But the busy% is 100% most of the time. regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
The sysinfo provider isn't the best choice for measuring disk IO times. Run; #dtrace -s /usr/demo/dtrace/iotime.d /jim Hans-Peter wrote: Hello all, I added a clause to my script. sysinfo::: /self-traceme==1 pid == $1/ { trace(execname); printf(sysinfo: timestamp : %d , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 - tsd_agent_get oracle timestamp 1795469946839100 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 - ufs_lockfs_end oracle timestamp 1795469946846700 24 - ufs_write oracle timestamp 1795469946849600 24 - fop_write oracle timestamp 179546994685350057,365,700 24 | pwrite syswriteoracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Also - since this is Oracle, are the Oracle files on a file system, or raw devices? If a file system, which one? /jim Jim Mauro wrote: The sysinfo provider isn't the best choice for measuring disk IO times. Run; #dtrace -s /usr/demo/dtrace/iotime.d /jim Hans-Peter wrote: Hello all, I added a clause to my script. sysinfo::: /self-traceme==1 pid == $1/ { trace(execname); printf(sysinfo: timestamp : %d , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 - tsd_agent_get oracle timestamp 1795469946839100 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 - ufs_lockfs_end oracle timestamp 1795469946846700 24 - ufs_write oracle timestamp 1795469946849600 24 - fop_write oracle timestamp 179546994685350057,365,700 24 | pwrite syswriteoracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Start with iostat. It's simple, and provides an average of service times for disk IOs (iostat -xnz 1, the asvc_t column is average service times in milliseconds). As Jim Litchfield pointed out in a previous thread, keep in mind it is an average, so you won't see nasty peaks, but if the average is bad, it's time to talk to the SAN/Storage folks. Use DTrace with the IO provider to get more precise measurements. You can borrow from existing scripts in /usr/demo/dtrace to add time stamps, and use aggregations to capture IO times. If you need assistance with such a script, please let us know and we'll be glad to post one (it looks like you're already pretty comfortable with DTrace...). Thanks, /jim Hans-Peter wrote: In order to get more information on IO performance problems I created the script below: #!/usr/sbin/dtrace -s #pragma D option flowindent syscall::*write*:entry /pid == $1 guard++ == 0/ { self - ts = timestamp; self-traceme = 1; printf(fd: %d, arg0); } fbt::: /self-traceme/ { /* elapsd =timestamp - self - ts; printf( elapsed : %d , elapsd); */ printf( timestamp : %d , timestamp); } syscall::*write*:return /self-traceme/ { self-traceme = 0; elapsed=timestamp - self - ts; printf( timestamp : %d , timestamp); printf(\telapsed : %d , elapsed); exit(0); } I gives me the timestamp for every fbt call during a write system call. A snippet is here below 8 - schedctl_save timestamp : 1627201334052600 8- savectx timestamp : 1627201334053000 0 - restorectx timestamp : 1627202741110300 difference = 1.407.057.300 0- schedctl_restore timestamp : 1627202741115100 0- schedctl_restore timestamp : 1627202741116100 Visible is that the thread of for 1.4 s off cpu. Storage is on SAN with fibers between the system and the storage. Is it possible to dig deeper with dtrace to see how the HBA's are performing. Other suggestions are welcome too. Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Concurrent Disk Activity from Solaris Performance Tools Chap. 4.17.4
Hi Paul - One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. I have been using a data file for an Oracle database as the test subject for this work. I have several confusion points with this section. • The graph mentions ‘strategy’ and ‘biodone’ which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. The DTrace script uses the IO provider io:::start and io:::done probes. io:genunix::start enables several probes, including the bdev_strategy kernel routine; nv98 dtrace -n 'io:genunix::start' dtrace: description 'io:genunix::start' matched 3 probes CPU ID FUNCTION:NAME 0 22920 bdev_strategy:start 0 22920 bdev_strategy:start 0 22920 bdev_strategy:start io:genunix::done enables a probe in biodone; nv98 dtrace -n 'io:genunix::done' dtrace: description 'io:genunix::done' matched 1 probe CPU ID FUNCTION:NAME 0 22908 biodone:done The actions in the dtrace script gather timestamps and block numbers at each firing (assuming the predicates evaluate true). • In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ‘0’ and the points for ‘strategy’. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ‘start’ and something earlier. The time values on the ‘X’ axis also don’t fall out of the data generated by the DTrace script. The script in the book generates output that looks like this; nv98 ./iot.d 122065039977,2100, 122065039980,,2100 122070310632,72, 122070310637,,72 122070310833,81, 122070310836,,81 . . . The value on the left is the timestamp, the value on the right is the block number. The data was imported into a spreadsheet, and the math was done on the start time and stop (done) time for each block, resulting in IO times on a per-block basis. • How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. The lookup in the file system memory cache happens above the bio layer. Put another way, if we're hitting bdev_strategy, we need to do a disk IO to get the data (we already missed in the cache). HTH, Thanks, /jim I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ‘start’ and ‘end’ to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. I have attached an Excel spreadsheet with what I was able to create based on the data collected. My thanks for any clarifications to these confusions. pdc ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Concurrent Disk Activity from Solaris Performance Tools Chap. 4.17.4
Hey Paul - I should add that iostat -xnz 1 is a great method for determine how well the SAN is performance. The asvc_t times are disk IO service times in milliseconds. I usually start there to sanity check disk IO times... Thanks, /jim Paul Clayton wrote: Hello.. Due to growing performance problems on numerous systems, I have been reading through a lot of information about DTrace and what it can find out about a system. It is a great tool, and while there is a learning curve to using it successfully, getting useful information quickly is helped by such books as the ‘Solaris Performance Tools’ book. In light of us also ramping up new SANs and ever larger SAN fabrics, I have long been wondering what the times are for getting data in/out of the SANs. If we talk with the SAN vendor they do the say the SAN not a problem and we should look elsewhere. Talk with the fabric folks and they say no problems, look elsewhere. So, it was with high interest that I have been reading Chapter 4 in the book about disk activity multiple times and trying various commands out on systems here. One thing I have been puzzled with a lot this weekend is the information and plot in Figure 4.7. This section if I understand it correctly, offers the means to track the actual times from when an IO starts in the kernel to when it completes, implying the time to either read or write from disk or memory cache. I have been using a data file for an Oracle database as the test subject for this work. I have several confusion points with this section. • The graph mentions ‘strategy’ and ‘biodone’ which seem to imply TNF based data, not output from the DTrace script above Figure 4.7. • In looking at the data gotten from the DTrace script I see no way to generate the graph of Figure 4.7. Specifically the time difference between ‘0’ and the points for ‘strategy’. With the DTrace script we have the start time of the IO. I see no way to determine some amount of time between ‘start’ and something earlier. The time values on the ‘X’ axis also don’t fall out of the data generated by the DTrace script. • How can it be determined if the IO completed from a memory cache or required an I/O to a physical disk? I have a lot of times less than 0.5 ms but also have a fair number that are in the range of 1 ms to 300 ms. I modified the script to dump out the size of the I/O being done and that was interesting to see some unexpected lengths. Also added ‘start’ and ‘end’ to the appropriate lines as a sanity check to make it easier to pair up the entries. Should always have one start/end pair for a block address. I have attached an Excel spreadsheet with what I was able to create based on the data collected. My thanks for any clarifications to these confusions. pdc ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] truss -fall equivalent in DTrace
What kind of system is this, and what release of Solaris? Enabling all the probes for all the function entry points in a process (pid$1:::entry) can take some time, and may make your terminal window appear hung, but it should not almost hang your system (unless you did this on a laptop or small, single CPU machine). If you have thread contention, use the plockstat provider. Run plockstat -V and save the output. The -V flag will generate the actual D that plockstat uses (to stderr if I remember right). You can save that and use it as a building block for chasing with threads are hitting your contended locks. Of course, you may find the output of plockstat -A -p PID is all you need... Thanks, /jim [EMAIL PROTECTED] wrote: I tried this script (attached below) but it printed only 1 thread (shown below). When I tried :::entry and my system was almost hung. I think pstack is good enough for my purpose. :-) BTW any script to find out which two threads are causing lock contention /deadlock? #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs pid$1:::entry / this-thread_is_already_printed != 1 / { this-thread_is_already_printed = 1; printf(thread %d: \n, tid); ustack(50); } $./pstack.d 16028 thread 11: libc.so.1`_lwp_mutex_lock libc.so.1`_lwp_cond_reltimedwait+0x78 libc.so.1`_lwp_cond_timedwait+0x1c libjvm.so`__1cHMonitorEwait6Mil_i_+0x328 libjvm.so`__1cIVMThreadDrun6M_v_+0x1b4 libjvm.so`__1cG_start6Fpv_0_+0x208 libc.so.1`_lwp_start --- On Fri, 11/14/08, Adam Leventhal [EMAIL PROTECTED] wrote: From: Adam Leventhal [EMAIL PROTECTED] Subject: Re: [dtrace-discuss] truss -fall equivalent in DTrace To: [EMAIL PROTECTED] [EMAIL PROTECTED] Cc: Mark Plotnick [EMAIL PROTECTED], dtrace-discuss@opensolaris.org Date: Friday, November 14, 2008, 7:32 PM On Fri, Nov 14, 2008 at 12:40:55AM -0800, [EMAIL PROTECTED] wrote: Can I get pstack equivalent script using DTrace? You can use ustack() at any probe. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Dtrace command
That's not a query that can be answered in a forum like this. You need to do some reading. Starting with the docs on the Wiki site (wikis.sun.com/dtrace). Go to blogs.sun.com, and search for dtrace, and read. Go to http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Intro. Read through the material posted there. It's designed specifically to get folks like yourself started on DTrace. Invest some time, learn what DTrace is and what it can do. Once you've done that, if you have specific questions about the use of DTrace in a specific context or even a general question, we are all happy to help. But we can't post an email response to showing you the simple way to use DTrace. Thanks, /jim muthamilan wrote: Hi All, Thanks for your response. I’m working on an environment, which having more then 50 Solaris 10 servers. I want use this Dtrace tool for trouble shooting and performance issue .So guide me ,the simple way how to use this Dtrace. Advance thanks. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] plockstat: processing aborted: Abort due to systemic unresponsiveness
Check out; http://opensolaris.org/jive/thread.jspa?messageID=267138#267138 You may be tripping over bug 6507659 (tsc differences between CPU's give dtrace_gethrtime() serious problems.). It looks like the fix went into 127112-03 (you're running -02). Best to install the latest patch, but you can work around this by enabling destructive actions. For plockstat, I think that's done as; plockstat -x destructive=1 -A -p Thanks, /jim Jianhua Yang wrote: Hello, I need help here about plockstat on X86 platform (Sun X4600 AMD) # plockstat -A -p 20034 *plockstat: processing aborted: Abort due to systemic unresponsiveness* # plockstat -e 5 -s 10 -A -x bufsize=100k -x aggsize=20m -p 20034 *plockstat: processing aborted: Abort due to systemic unresponsiveness* # ps -ef | grep 20034 algodev 20034 1 2 07:00:54 ? 86:17 /data/algodev/arina/PCM/7.6.3d/bin/galaxy/pcmsrv_wombat -INSTANCE INST_P_6 -fee # uname -a SunOS nygeqptsoruat1 5.10 Generic_127112-02 i86pc i386 i86pc Thanks, James Yang Email : [EMAIL PROTECTED] --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] DTrace on Mac OS X
To All (This is mainly for the Mac DTrace 3, Adam Leventhal, Bryan Cantrill, Mike Shapiro).. We appreciate your desire to go to the source. It's a lot like posting a question on relativity, and indicating you'd really like an answer from Einstein :^) That said, there's great news - the DTrace community is growing all the time, and there are many, many DTrace experts that can help. So Brian, Mike and Adam (and Eric, and Brendan, and Jon), can spend more time catching up on People magazine, tracking alien landings in the National Enquirer, and playing FishPong :^) From what I have learned from DTrace is that you can monitor the system calls of either the OS probes or individual processes. Let's start with terminology here. Systems calls are not issued by the OS - application processes (threads) issue system calls to enter the OS for a privileged operation (e.g. open a file). Please clarify what you mean by OS probes - are you referring to DTrace probes, or OS X IOKit probes? Are you interested in dtrace'ing dtrace? I'm not breaking your you-know-whatsI'm sincerely interested in making sure we agree on terminology. The notion of OS probes issuing system calls is not clear to me. Also, on the Mac you are able to monitor Objective-C calls by specifying an individual Cocoa application. For my particular project I need to track 3 tasks: when any arbitrary application accesses a file, when any web page is accessed, and when any mail client assesses an email. My knowledge of DTrace tells me that I need to create probes for every application that can do these three types of tasks. Is there another approach that does not need to know every type of application or process that can do these 3 tasks? ... any arbitrary application accesses a file - Are you interested in tracking access to one specific file by any possible process? I assume the machine you're monitoring is running a WEB server, and an email server? There's a couple ways to do this, depending somewhat on your configuration. For tracking access to a file, you can use the system call provider and a predicate on arg0, which is a pointer to a pathname string. For the second two items, I'd start with monitoring the httpd process and mail server process with the syscall provider to determine the syscall path to the patterns you're interested in. Once you've established that, you can whittle all this down to a relatively simple script that does what you want. Make sense? (In case it does not) - In other words, as a general methodology, I sometimes use dtrace to get a broad view of what a particular application process is doing, e.g. grab a system call profile of a httpd process, to establish the specific calls (and, in some cases, args) used when something of interest is happening (e.g. accessing a static WEB page). From there I use that information to create a dtrace script that is intended to hone-in on that particular flow of activity. HTH, /jim I appreciate any feedback one could give. Thanks in advanced, Blake ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] DTrace on Mac OS X
and there are many, many DTrace experts that can help. So Brian, Mike and Sorry; s/Brian/Bryan ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] DTrace on Mac OS X
I'm not sure I can add anything beyond Bryan's response. I'm not an expert in WEB servers or the multimedia server stack (other software components that get integrated for building these cool sites), but I would first look at add-ons, compiler flags, options, etc, for the actual software components running the service. It's not unusual to find additional logging/tracking features that can be enabled. I see DTrace as a fit here to help you understand how to build the software you're interested in building (which is what Bryan suggested in his second reason response), but not as a component of the final application. If you have specific questions on how to do that, let's continue the dialogue. Thanks, /jim Blake Sawyer wrote: Jim, Thanks so much for your help. Also thanks for understanding my terminology mistakes, I am new to a lot of this. Let me clear some of these things up by explaining the idea of my project. I want to create an indexing application that will work with a location/context aware system. This application will automatically index media files(documents, video, audio, webpages, mail, etc.) socially and episodically. For example, in a meeting scenario, the system will track media files accessed and tag them with who attended the meeting and the time and location of the meeting. I am trying to prove that giving a user the ability to reference files socially and episodically, they will be better at re-contextualization. You can think of it as a new personal information manager. Therefore, the purposes of using DTrace is not to run any kind of test on a web server or email server. I want this to run on an individual's computer that determines when meaningful media access occurs. It will then query the location/context aware system to tag the media appropriately. I am just unsure if DTrace is the right approach here. This is not really the main purpose of DTrace, but my professors and I agree that this may be a good starting point. As I mentioned Mac supports AppleScript that can do this, though it is quite inefficient. It seems to me that if I want to create an application like this, I would need to monitor each application using DTrace. This approach would not be very extensible for future versions. I hope this clears up what I am trying to accomplish. Again, thanks so much for the help. -Blake On Sep 2, 2008, at [Sep 2]1:06 PM, Jim Mauro wrote: To All (This is mainly for the Mac DTrace 3, Adam Leventhal, Bryan Cantrill, Mike Shapiro).. We appreciate your desire to go to the source. It's a lot like posting a question on relativity, and indicating you'd really like an answer from Einstein :^) That said, there's great news - the DTrace community is growing all the time, and there are many, many DTrace experts that can help. So Brian, Mike and Adam (and Eric, and Brendan, and Jon), can spend more time catching up on People magazine, tracking alien landings in the National Enquirer, and playing FishPong :^) From what I have learned from DTrace is that you can monitor the system calls of either the OS probes or individual processes. Let's start with terminology here. Systems calls are not issued by the OS - application processes (threads) issue system calls to enter the OS for a privileged operation (e.g. open a file). Please clarify what you mean by OS probes - are you referring to DTrace probes, or OS X IOKit probes? Are you interested in dtrace'ing dtrace? I'm not breaking your you-know-whatsI'm sincerely interested in making sure we agree on terminology. The notion of OS probes issuing system calls is not clear to me. Also, on the Mac you are able to monitor Objective-C calls by specifying an individual Cocoa application. For my particular project I need to track 3 tasks: when any arbitrary application accesses a file, when any web page is accessed, and when any mail client assesses an email. My knowledge of DTrace tells me that I need to create probes for every application that can do these three types of tasks. Is there another approach that does not need to know every type of application or process that can do these 3 tasks? ... any arbitrary application accesses a file - Are you interested in tracking access to one specific file by any possible process? I assume the machine you're monitoring is running a WEB server, and an email server? There's a couple ways to do this, depending somewhat on your configuration. For tracking access to a file, you can use the system call provider and a predicate on arg0, which is a pointer to a pathname string. For the second two items, I'd start with monitoring the httpd process and mail server process with the syscall provider to determine the syscall path to the patterns you're interested in. Once you've established that, you can whittle all this down to a relatively
Re: [dtrace-discuss] Can DTrace tell me if my process is 32 or 64 bit?
You could use curpsinfo-pr_dmodel is a predicate. probe / curpsinfo-pr_dmodel == 1/ { 32-bit process } probe / curpsinfo-pr_dmodel == 2/ { 64-bit process } /jim Bruce Chapman wrote: I had a simple D script that looks for use of the DP_POLL ioctl with a long timeout for any process on the system, but I could not find an easy way to determine whether to use dvpoll or dvpoll32 structures to do so. An ugly hack based upon pointer location/value works, but is clearly unacceptable (this was in a single probe syscall::ioctl:entry/arg1==0xd001/ ): nfds = (dvpoll-dp_nfds 100) ?dvpoll32-dp_nfds : dvpoll-dp_nfds; Since I could find no way for DTrace to help me out here I looked at source code and found that dpioctl takes a mode argument that has a bitmask which will indicate DATA_MODEL, so I modified the script to use that, which works and should be stable, but it seems DTrace should provide coders with it's own way of determining the data_model to save people such gyrations. Also, though this script works fine on S10 FCS SPARC/x86 up to snv_86 (Indiana), it fails on snv_93 (image updated Indiana - it works on a 32 kernel snv_95 machine but I haven't found a 64 bit snv_95 to try it on) with : dtrace: failed to compile script ./ioctl_DP_POLL.d: line 25: operands have incompatible types: dblk_t != union Is this a known DTrace bug in snv_93 or just more restrictive type checking (seems like a bug...if not, can someone get the script to work in snv_93?) Anyway, here's the script that works on all S10 versions through snv_86 but is more complicated than it would be if DTrace had a mechanism of letting me know if the binary being traced was 32 or 64 bit : #!/usr/sbin/dtrace -s #pragma D option quiet struct dvpoll* dvpoll; struct dvpoll32 * dvpoll32; BEGIN { printf(If timeout is -1 or big, you could get hit by bugid 6724237\n); prev_timeout = -2; prev_fd = prev_nfds = prev_pid = -1; } syscall::ioctl:entry /arg1 == 0xd001/ { self-fd=arg0; } fbt:poll:dpioctl:entry { self-bitsize64 = (arg3 0x0020); /* model.h: DATA_MODEL_LP64 */ } fbt:poll:dpioctl:entry /self-bitsize64/ { dvpoll=copyin(arg2, sizeof(*dvpoll)); self-nfds = dvpoll-dp_nfds; self-timeout = dvpoll-dp_timeout; } fbt:poll:dpioctl:entry /!(self-bitsize64)/ { dvpoll32=copyin(arg2, sizeof(*dvpoll32)); self-nfds = dvpoll32-dp_nfds; /* DTrace broken in SNV_93? */ self-timeout = dvpoll32-dp_timeout; } fbt:poll:dpioctl:entry / (self-fd != prev_fd) || (self-nfds != prev_nfds) || (self-timeout != prev_timeout) || (prev_pid != pid) /{ printf(%s pid=%d/%d calling ioctl(%d, DP_POLL, nfds=%d timeout=%d)\n, execname, pid, tid, self-fd, self-nfds, self-timeout); prev_pid = pid; prev_fd = self-fd; prev_nfds = self-nfds; prev_timeout = self-timeout; self-fd = self-nfds = self-timeout = self-bitsize64 = 0; } -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] About memory usage by a specific process
prstat -s rss The RSS column is the resident set size, or roughly the amount of memory being used be the process. The -s rss tells prstat to sort based on rss size. HTH, /jim YOUNSI RIADH wrote: *Hi * *Using vmstat I noticed that free memory is getting lower during a certain period of time (2 or 3 days) then that memory is recovered. Is there a**_ dtrace script_** that would help to identify the process that uses that** memory?** * ** ** *Thanks* ***Riadh YOUNSI* ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] IO probes and forcedirectio
I don't understand the question. I see forcedirectio set as a mount option, so I would expect ufs:directio_start:entry to fire /jim Sébastien Bouchex Bellomié wrote: Hi, The following script is working fine as it display the directio start message [...] #!/usr/sbin/dtrace -s #pragma D option quiet fbt:ufs:directio_start:entry { printf(directio start\n); } [...] However, taking my mount line, I don't see anything special : /rdata on /dev/dsk/c2t1d1s0 read/write/setuid/devices/intr/forcedirectio/largefiles/logging/xattr/onerror=panic/dev=1d80008 on Thu Apr 24 15:41:27 2008 Thanks Seb -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: vendredi 23 mai 2008 17:26 To: Sébastien Bouchex Bellomié Cc: dtrace-discuss@opensolaris.org Subject: Re: [dtrace-discuss] IO probes and forcedirectio That's why I came to the conclusion that the io provided ignore forcedirectio mounted partition, but I wanted to be sure. io:genunix::start is in bdev_strategy(), which is lower than the file system level. It is not concerned with any file system, or its mount options. I suspect that you may not be recognizing the probes as belonging to this file system. Also, if there is a small number of UFS file systems with this option set, you may try to use directio_start:entry probe as a quick sanity check. HIH Bhaskar ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] disp_getwork question
disp_getwork() is a kernel function that gets called from the idle loop (dispatcher - go get work, meaning find me a runnable thread). (usermode) simply means that the CPU(s) were running in usermode, not kernel mode during the profile, and lockstat has not visibility into what user functions where running during the sample. The best it can do it tell you the cpus(s) were not running in the kernel. Try; #dtrace -n 'profile-997hz / arg0 != 0 curthread-t_pri != -1 / @[func(arg0)] = count() }' for a kernel profile that filters out the idle loop. /jim Jianhua Yang wrote: Hello, I collected lockstat statistics when CPU idle is less 10, see below: Profiling interrupt: 120600 events in 5.038 seconds (23940 events/sec) Count indv cuml rcnt nsec CPU+PILCaller --- 3430 3% 3% 0.00 1225 cpu[3]+11 disp_getwork 3378 3% 6% 0.00 1429 cpu[512]+11disp_getwork 3319 3% 8% 0.00 1254 cpu[515]+11disp_getwork 1407 1% 61% 0.00 1238 cpu[11](usermode) 972 1% 62% 0.00 2664 cpu[18](usermode) 953 1% 63% 0.00 833 cpu[528] (usermode) question: 1. what is disp_getwork ?I guess this function comes from shced ? if trace sched procss functions, say 5 seconds, what is overhead of performance? 2. caller = (usermode), what is usermode mean here ? why did it not trace out the functions ? Thanks, James Yang --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] dtrace: error invalid address (0x1f257c000) in action #2
Read this thread: http://www.opensolaris.org/jive/thread.jspa?messageID=10250 Thanks, /jim Jianhua Yang wrote: Hello, new to dtrace, need help here when run the following dtrace, it produced dtrace errors, why it get such errors ? # dtrace -n 'pid$target:libc:malloc:entry { @j[jstack()] = count(); }' -p 11539 dtrace: description 'pid$target:libc:malloc:entry ' matched 1 probe dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 ^C dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f257c000) in action #2 dtrace: error on enabled probe ID 1 (ID 112884: pid11539:libc.so.1:malloc:entry): invalid address (0x1f24f8000) in action #2 libc.so.1`malloc libESPD12d.so`__1cPCFETUByteStream2t5B6MIIHHpc_v_+0xa4 libESPD12d.so`__1cMCFETPMessageNCreateInbound6Fpci_p0_+0x618 libESPD12d.so`__1cMCFETISessionFRXMsg6M_pnMCFETPMessage__+0x60 libESPD12d.so`__1cMCFETIManagerPInboundDataFlow6M_v_+0x178 libESPD12d.so`__1cYStartInboundDataFlowFuncDrun6M_v_+0xc libESPD12d.so`PosixThreadFunction+0x24 libc.so.1`_lwp_start 1 . Thanks, James Yang --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___
Re: [dtrace-discuss] RES: RES: Process in LCK / SLP (Please)
You may want to cross-post to a Java alias, but I've been down this road before. Java will call into malloc() for buffers for network reads and writes that are larger than 2k bytes (the 2k is from memory, and I think it was a 1.5 JVM). A large number of malloc calls, and resulting contention on locks in the library, are due to the application doing network writes of larger than 2k. Newer JVMs (1.6) may improve this, but I'm not sure. There's also an alternative set of classes and methods, NIO, which also can help (although I've heard tell that NIO brings other problems along with it, but I can not speak from personal experience). At this point, I think you need to consult with Java experts to determine what options you have for buffer allocation for network IO from the Java heap, versus the current behavior of the JVM dropping back to malloc for allocating buffers for network IO. The other option of course is determining if the code can be changed to use buffers smaller than 2k. Thanks, /jim Kleyson Rios wrote: OK jonathan, I understand. So, looking on right place now, i can see few locks and sometimes no locks (just Mutex Hold). But I still have many threads in 100% LCK. If I don't have a lot of locks, where is my problem ? Running rickey c weisner's script a get: (...) 25736 libc.so.1`_so_send+0x15 libjvm.so`__1cDhpiEsend6Fipcii_i_+0x67 libjvm.so`JVM_Send+0x32 libnet.so`Java_java_net_SocketOutputStream_socketWrite0+0x131 0xc3c098d3 10 25736 0xc3d2a33a 14 25736 libc.so.1`_write+0x15 libjvm.so`__1cDhpiFwrite6FipkvI_I_+0x5d libjvm.so`JVM_Write+0x30 libjava.so`0xc8f7c04b 16 25736 libc.so.1`stat64+0x15 21 25736 libc.so.1`_write+0x15 libjvm.so`__1cDhpiFwrite6FipkvI_I_+0x5d libjvm.so`JVM_Write+0x30 libjava.so`0xc8f80ce9 76 java 25736 kernel-level lock 1 java 25736 shuttle6 java 25736 preempted 7 java 25736 user-level lock511 java 25736 condition variable 748 Atenciosamente, -- Kleyson Rios. Gerência de Suporte Técnico Analista de Suporte / Líder de Equipe -Mensagem original- De: Jonathan Adams [mailto:[EMAIL PROTECTED] Enviada em: sexta-feira, 18 de abril de 2008 15:40 Para: Kleyson Rios Cc: dtrace-discuss@opensolaris.org Assunto: Re: [dtrace-discuss] RES: Process in LCK / SLP (Please) On Apr 18, 2008, at 1:03 PM, Kleyson Rios wrote: Hi przemol, Bellow output of plockstat for malloc and libumem. Both many locks. Why changing to libumem I didn't get less locks ? You're looking at Mutex hold statistics, which don't mean a lot (unless contention is caused by long hold times) The important thing for multi-threaded performance is *contention*. (Spinning and blocking) Those are the statistics you should be looking at. Both malloc and libumem use locks to protect their state; libumem just uses many locks, in order to reduce contention. Cheers, - jonathan ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Java Process in LCK/SLP
Try linking the JVM to libumem.so.1 (export LD_PRELOAD=/usr/lib/libumem.so.1) I have, in the past, been able to reduce lock contention in malloc using libumem malloc. If that does not help, you need to see where the mallocs are coming from in the code, and see if there's opportunity to change the app such that the JVM makes fewer malloc calls. dtrace -n 'pid$target:libc:malloc:entry { @s[jstack()] = count() }' 500 threads - does your need that many threads? Are the locks mostly associated with condition variables? Typically, the only way to solve lock contention problems is by changing the code... HTH, /jim Kleyson Rios wrote: Hi, I am a newbie in dtrace. There is a multi-thread application running in my box. But looking prstat –Lmp java i saw that almost all time is LCK and SLP. A ran plockstat and saw the most locks were malloc, so I changed my application to use mtmalloc, but didn’t resolv the problem. After changed malloc to mtmalloc, i ran again plockstat and i’m still getting many locks. How can I use better dtrace to find out the problem ? PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 5637 root 94 0,0 0,0 0,0 0,0 4,2 0,0 1,4 19 96 61 0 java/7 5637 root 1,3 0,2 0,0 0,0 0,0 98 0,0 0,2 2K 7 2K 0 java/5 5637 root 1,1 0,2 0,0 0,0 0,0 98 0,0 0,3 3K 16 3K 0 java/3 5637 root 1,1 0,1 0,0 0,0 0,0 98 0,1 0,4 1K 6 1K 0 java/4 5637 root 1,1 0,1 0,0 0,0 0,0 98 0,0 0,4 1K 10 1K 0 java/6 5637 root 0,7 0,1 0,0 0,0 0,0 33 65 0,8 11 17 129 0 java/347 5637 root 0,7 0,1 0,0 0,0 0,0 65 34 0,0 20 0 129 0 java/134 5637 root 0,6 0,1 0,0 0,0 0,0 93 5,7 0,2 21 1 133 0 java/84 5637 root 0,4 0,1 0,0 0,0 0,0 49 51 0,1 20 1 135 0 java/435 5637 root 0,5 0,1 0,0 0,0 0,0 14 85 0,1 16 0 87 0 java/162 5637 root 0,5 0,0 0,0 0,0 0,0 23 76 0,8 3 2 44 0 java/291 5637 root 0,4 0,1 0,0 0,0 0,0 36 63 0,6 9 5 81 0 java/412 5637 root 0,4 0,1 0,0 0,0 0,0 52 46 1,0 9 0 70 0 java/370 5637 root 0,3 0,1 0,0 0,0 0,0 81 17 1,2 9 4 75 0 java/376 5637 root 0,3 0,1 0,0 0,0 0,0 37 63 0,0 8 0 77 0 java/174 5637 root 0,3 0,1 0,0 0,0 0,0 42 57 0,1 7 1 61 0 java/145 5637 root 0,3 0,1 0,0 0,0 0,0 43 56 0,9 12 11 53 0 java/114 5637 root 0,2 0,1 0,0 0,0 0,0 2,6 96 0,8 8 0 38 0 java/269 5637 root 0,3 0,0 0,0 0,0 0,0 98 0,1 1,1 10 0 38 0 java/361 5637 root 0,2 0,1 0,0 0,0 0,0 2,4 97 0,0 6 2 37 0 java/81 5637 root 0,2 0,1 0,0 0,0 0,0 84 14 1,2 6 7 57 0 java/292 5637 root 0,2 0,1 0,0 0,0 0,0 2,5 97 0,3 5 1 33 0 java/192 5637 root 0,2 0,1 0,0 0,0 0,0 85 14 0,8 7 8 54 0 java/212 5637 root 0,2 0,1 0,0 0,0 0,0 3,2 96 0,2 6 6 36 0 java/149 5637 root 0,2 0,0 0,0 0,0 0,0 42 58 0,1 7 1 50 1 java/411 5637 root 0,2 0,1 0,0 0,0 0,0 2,3 97 0,3 8 3 38 0 java/236 5637 root 0,2 0,1 0,0 0,0 0,0 43 56 0,9 10 4 44 0 java/438 5637 root 0,2 0,0 0,0 0,0 0,0 2,5 97 0,1 5 1 34 0 java/137 5637 root 0,2 0,0 0,0 0,0 0,0 2,2 98 0,0 7 1 34 0 java/179 5637 root 0,1 0,1 0,0 0,0 0,0 2,5 97 0,6 1 4 111 0 java/20 5637 root 0,2 0,0 0,0 0,0 0,0 91 8,2 0,2 4 8 48 0 java/72 5637 root 0,2 0,0 0,0 0,0 0,0 3,1 97 0,2 6 2 35 0 java/75 5637 root 0,2 0,0 0,0 0,0 0,0 21 78 0,2 3 2 38 0 java/109 5637 root 0,2 0,0 0,0 0,0 0,0 89 9,9 0,9 3 6 36 0 java/256 5637 root 0,2 0,0 0,0 0,0 0,0 99 0,0 1,0 4 8 39 1 java/458 Total: 1 processes, 489 lwps, load averages: 1,92, 1,74, 1,65 Atenciosamente, -- Kleyson Rios. Gerência de Suporte Técnico Analista de Suporte / Líder de Equipe Governo do Estado de Goiás Agência de Administração Diretoria de Informática +55 62 3201-6582 Você sabe a diferença entre pessoas bem sucedidas e pessoas mal sucedidas ? Pessoas bem sucedidas estão dispostas a fazer as coisas que as pessoas mal sucedidas não estão. ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] howto measure CPU load
Hi Travis - Your first clue here is the backtick operator (`) used to extract hp_avenrun[0]. The backtick operator is used to read the value of kernel variables, which will be specific to the running kernel. That is, Solaris, Mac OS X (Darwin), FreeBSD and all other kernels with DTrace will not have portability when it comes to dtrace scripts that use the backtick variable, since by definition they will be kernel specific. hp_avenrun[] is a Solaris kernel variable: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/clock.c#97 For some kernel variables (like avenrun), you could look around for something similar in your target kernel: kilroy nm -x /mach_kernel | grep avenrun 004d25ac 0f 08 0002dbb9 _avenrun kilroy Of course, without the source, it's not clear what the variable type is Soyour example below is specific to Solaris. If you see the backtick operator, think kernel specific. HTH, /jim Travis Crawford wrote: Hi dtrace-discuss, Having recently upgraded to OS X 10.5 I'm relatively new to DTrace, so apologies if this has been covered before; I did search the archives and didn't find anything though. As a DTrace starter project I'd like to understand how to extract common performance metrics before diving too far in. One of these common metrics is load average, which I thought would be pretty easy. However, I'm confused by this code snippet (from http://www.brendangregg.com/DTrace/iotop): /* fetch 1 min load average */ this-load1a = `hp_avenrun[0] / 65536; this-load1b = ((`hp_avenrun[0] % 65536) * 100) / 65536; I can't figure out where hp_avenrun is defined. I searched http://wikis.sun.com/display/DTrace as well as looking at all the probes on my system ($ sudo dtrace -l | grep hp_avenrun) and came up empty. So my question is, when reading d scripts, how should I go about finding out what these mysterious variables are? Thanks, Travis ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Can DTrace display non-instrumented function argument counts and types?
For kernel modules, _if_ the supplied kernel code (driver, file system, whatever), was compiled with symbolic information, you can determine the argument list and types with mdb: # mdb -k Loading modules: [ unix genunix specfs dtrace uppc pcplusmp scsi_vhci ufs mpt ip hook neti sctp arp usba fctl nca lofs zfs random audiosup sppp ptm crypto ipc md cpc fcip fcp logindmux nsctl sdbc sv ii rdc ] ufs_read::nm -f ctype C Type int (*)(struct vnode *, struct uio *, int, struct cred *, struct caller_context *) The example above displays the arg list for the kernel ufs_read() function. You can also determine structure members: ::print -t struct vnode { kmutex_t v_lock { void *[2] _opaque } uint_t v_flag uint_t v_count void *v_data struct vfs *v_vfsp . . . If that works (meaning the kernel module was indeed compiled with symbolic type information), you can use dtrace to instrument the entry points in the kernel functions of interest, and grab arguments using either the args[] array (typed data) or args0...args1...etc (int64_t's). HTH, /jim Gordon Marler wrote: Is it possible to use DTrace to display the number of arguments and their types for userland or kernel functions that are not explicitly instrumented or documented by their authors? We're talking about functions provided by Veritas VxVM/VxFS and the like, and for example, we might want to try tracing I/Os from App = VxFS = VxVM = ssd driver to see where I/Os get aggregated or broken up. But without being able to determine function signatures, it will be very hard to even begin such a process. It would seem that this is possible, but I haven't seen any examples of such a use. I'm presuming that this would be done via fbt. Gordon Marler [EMAIL PROTECTED] -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Getting total memory allocated so far
I hate to beat this to death, but it really sounds like you can make your life simple and use prstat and pmap to track the information you want. Heap memory usage tracking is tricky, since libc malloc may or may not call into the kernel (sbrk) to satisfy a memory request. Also, tracking malloc and keeping a running total in dtrace is dead easy, but tracking free and correlating to how much space was freed is tricky, since free takes an address, not a size. You could build an array of addresses in a pidPID:libc:malloc:return probe clause, but making use of it to track free's and sizes is tricky... It really all gets down to the level of granularity and accuracy that you require. prstat and pmap will provide the information you need with reasonable accuracy - pmap will allow you to track heap segment sizes (pmap -x PID | grep heap) - you could easily script something with pmap and generate a summary report. Finally, as Rickey pointed out, you could also write some C code that uses procfs and grabs the desired data. Thanks, /jim Lally Singh wrote: @Jeremy: yeah, I'm tracking how much memory it's actually using right now.. sbrk makes it complicated, but I figure that more advanced memory management may help, once I've measured how it's getting used today. Via mmap() for large transient allocations and/or some smartness about allocating smartly to keep the working set of pages reasonable. @James: I saw that, but could I use it on a running program without stopping it? The dtrace scripts don't do much damage, and the program still gets all its work done at ~60 Hz for its clients. Single threaded, no less. Some more details: I was trying to keep this all a simple question, but with memory there rarely are simple questions. My research is on the scalability of massively multiplayer online games. So, I've written a load simulator which logs in from remote computers and starts playing the game. I've got a bunch of dtrace scripts that measure the network, memory, and cpu usage as I add more simulated clients. The scripts haven't shown themselves to interfere significantly with performance (yet), but in exchange they put out data once a second, which another script reads shoves into a db for me. I'd like to monitor memory usage without interfering with my other measurements. Hence, I'd love to know something close to a global var (which'd be great, malloc() already has to do some locking (for normal malloc, haven't seen umem), so why not?) or some way to quickly assemble a running total. Right now all I've got is a pid provider probe for malloc free that reads the headers on the memory blocks they use. Thanks in advance everyone! On Jan 2, 2008 1:25 PM, James Carlson [EMAIL PROTECTED] wrote: Lally Singh writes: I'd like to track to total amount of memory the application's actually using right now. More specifically, I guess these rules would do it: 1. start at M=0 2. for each call to malloc(): set M += size of block returned by malloc 3. for each call to free(): set M -= size of block freed. Using env LD_PRELOAD=libumem.so myapp with mdb's umem support should get you that (plus quite a bit more). -- James Carlson, Solaris Networking [EMAIL PROTECTED] Sun Microsystems / 35 Network Drive71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] Solaris Dynamic Tracing Guide gets Wikified!
Great question. As a dtrace user and documentation reader, I would not want to need to flip to another chapter, or another section, to read about platform differences for a particular provider, function, etc. I'm not saying you suggested that, I'm just thinking out loud... I think a reasonable way to do this, and maintain consistency, is to add a Platform Notes heading to each section - or perhaps at a Chapter granularity. Within Platform Notes, we have subsections for Mac OS X, FreeBSD, etc, and add the platform specific notes there. It may also be useful to add a chapter that discusses platform differences at a high level. Just some thoughts...if the platform differences are not vast enough to warrant adding format changes to the document, perhaps they can be handled in footnotes. (keeping in mind footnotes are considered poor writing style). Thanks, /jim Quinn wrote: At 10:23 -0800 28/11/07, Adam Leventhal wrote: On Tue, Nov 27, 2007 at 07:46:37PM +, Jon Haslam wrote: To gain access to the DTrace wiki space contact Adam ([EMAIL PROTECTED]). If you are working on a bug from the bug list make sure you update the owner column of the bug to reflect the fact. When you've finished please update the column in the bug list to reflect that fact. Actually, there's no need to contact me or anyone: the wiki is open for all registered users to modify. We'll tighted up security if we have any problems. How do you want to handle platform-specific changes? As I stumble across Mac OS X specific stuff, I'd like to add it to the book. To me that makes a lot more sense that maintaining a separate DTrace on Mac OS X document. For example, the discussion of the stop action says that you can start a process using prun, but that's not available on Mac OS X. On Mac OS X you can use kill -CONT pid. I'd love to just go in there and add a note to that effect, but I figured I'd ask about formatting conventions first. S+E ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
[dtrace-discuss] DTrace updates
http://blogs.sun.com/jonh/ The esteemed Jon Haslam posted some slides with examples of new features that went into DTrace since Solaris 10 was released. Good stuff, and a tip-o-the-hat to Jon. /jim ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org