You're trying to do something like these? http://dtrace.org/blogs/brendan/page/14/
On Tue, Mar 11, 2014 at 1:46 PM, Florian Heigl <[email protected]> wrote: > Thanks Eliijah for the explanations! > > rrdtool etc. won’t cut it unfortunately - at least thats what I think. > > I would like to trace activity on a whole server, for about 2 hours - and for > all/most processes it has running... > Naigos being involved means: > tracking what a few 100k processes did during their short lifetime. > (i.e. afaik RRDtool will be run 1400x30 times during that period _with_ > rrdcached enabled. > RRDcached will update / flush it’s journal many many times during that time. > Check_MK will do a lot of different things even under the same name > (inventory updates, precompiling / linking, and be called by nagios) > > The end result I’m after would show (example): > 2% of IO fell to Check_MK cleaning up the auto checks > 56% were updates of the RRD journal > 5% were actual RRD updates > -> indicating misconfiguration, and that it’s better to fix RRDcached & tune > for sequential IO than worry about the RRD IO grinder. > > I suppose I should split this into two problems: > 1. > Normalize / Better formatting of the output, so that I get a CSV like file, > with tricks like full and split path > > 2. > Hire someone who knows R to do reporting / graphs… (OK, that can be done > using RRD but you have no flexibility querying this) > > > The way I understand it it would still be re-usable for others that way. > > > > On 11.03.2014, at 16:11, Elijah Wright <[email protected]> wrote: > >> Hi Florian, >> >> Pretty sure you're going to need more data; Brendan's scripts are >> expecting *stack traces*, not just the latency numbers and the source >> process - the stacks are where the 'layers' in the flame graph >> visualizations come from. >> >> If you're just collecting the disk latency numbers, and not the >> function hierarchy of the process, you might want to just use graphite >> or rrdtool or something to plot that data - it should be pretty >> understandable, but I don't think you'll have something as useful as >> the flame graph output might be. You really want to know *which part* >> of some process is jamming away at the disk - not just that it is >> happening, the processname, and when. >> >> [This sort of correlation - "what program feature on my system is >> making disk latency blow chunks" - is extremely useful and just beyond >> the edge of what most people's monitoring tools and approaches can >> deal with. There's a really good reason that the flame graph pages >> are littered with DTrace code... ;-) ] >> >> best, >> >> --e >> >> >> >> On Mon, Mar 10, 2014 at 9:33 PM, Florian Heigl <[email protected]> >> wrote: >>> Hi, >>> >>> I'm trying to get some dependable data on Nagios IO. >>> Nagios does a lot of disk IO, which is known, but there's no hard numbers to >>> it. >>> It gets especially for systems that _have_ best practices applied: >>> - rrdcached is running, volatile data is written to a RAM disk, etc. >>> >>> My current approach is using systemtap and collecting only write accesses >>> and their latencies. >>> >>> This I have, using the sys call to IO probe here: >>> https://sourceware.org/systemtap/examples/keyword-index.html#FILE >>> ...and grep, since I don't really understand all of it. >>> >>> To turn it into something more worthwhile that can be used by more people >>> and show results easily, >>> I want to use the flame graph thing as described at >>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html >>> >>> The whole toolkit seems be able to work with system tap. >>> >>> The problem: >>> I'm apparently just too stupid. I don't know how to get started. >>> I do not remotely grasp how to take the flamegraph git repo and the script I >>> have and make them do "something" >>> (something being, a sort on IO time spend per path element of the files >>> written to) >>> >>> Did any of you try something similar? >>> Did any of you work with flame graphs and can give some advice? >>> >>> >>> Florian >>> >>> _______________________________________________ >>> Discuss mailing list >>> [email protected] >>> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss >>> This list provided by the League of Professional System Administrators >>> http://lopsa.org/ >>> > > _______________________________________________ > Discuss mailing list > [email protected] > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > This list provided by the League of Professional System Administrators > http://lopsa.org/ -- Jesse Becker _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
