Re: [lopsa-discuss] Systemtap / Dtrace and those great looking graphs

Jesse Becker Tue, 11 Mar 2014 14:06:12 -0700

You're trying to do something like these?
  http://dtrace.org/blogs/brendan/page/14/





On Tue, Mar 11, 2014 at 1:46 PM, Florian Heigl <[email protected]> wrote:
> Thanks Eliijah for the explanations!
>
> rrdtool etc. won’t cut it unfortunately - at least thats what I think.
>
> I would like to trace activity on a whole server, for about 2 hours - and for 
> all/most processes it has running...
> Naigos being involved means:
> tracking what a few 100k processes did during their short lifetime.
> (i.e. afaik RRDtool will be run 1400x30 times during that period _with_ 
> rrdcached enabled.
> RRDcached will update / flush it’s journal many many times during that time.
> Check_MK will do a lot of different things even under the same name 
> (inventory updates, precompiling / linking, and be called by nagios)
>
> The end result I’m after would show (example):
> 2% of IO fell to Check_MK cleaning up the auto checks
> 56% were updates of the RRD journal
> 5% were actual RRD updates
> -> indicating misconfiguration, and that it’s better to fix RRDcached & tune 
> for sequential IO than worry about the RRD IO grinder.
>
> I suppose I should split this into two problems:
> 1.
> Normalize / Better formatting of the output, so that I get a CSV like file, 
> with tricks like full and split path
>
> 2.
> Hire someone who knows R to do reporting / graphs… (OK, that can be done 
> using RRD but you have no flexibility querying this)
>
>
> The way I understand it it would still be re-usable for others that way.
>
>
>
> On 11.03.2014, at 16:11, Elijah Wright <[email protected]> wrote:
>
>> Hi Florian,
>>
>> Pretty sure you're going to need more data;  Brendan's scripts are
>> expecting *stack traces*, not just the latency numbers and the source
>> process - the stacks are where the 'layers' in the flame graph
>> visualizations come from.
>>
>> If you're just collecting the disk latency numbers, and not the
>> function hierarchy of the process, you might want to just use graphite
>> or rrdtool or something to plot that data - it should be pretty
>> understandable, but I don't think you'll have something as useful as
>> the flame graph output might be.  You really want to know *which part*
>> of some process is jamming away at the disk - not just that it is
>> happening, the processname, and when.
>>
>> [This sort of correlation - "what program feature on my system is
>> making disk latency blow chunks" - is extremely useful and just beyond
>> the edge of what most people's monitoring tools and approaches can
>> deal with.  There's a really good reason that the flame graph pages
>> are littered with DTrace code... ;-) ]
>>
>> best,
>>
>> --e
>>
>>
>>
>> On Mon, Mar 10, 2014 at 9:33 PM, Florian Heigl <[email protected]> 
>> wrote:
>>> Hi,
>>>
>>> I'm trying to get some dependable data on Nagios IO.
>>> Nagios does a lot of disk IO, which is known, but there's no hard numbers to
>>> it.
>>> It gets especially for systems that _have_ best practices applied:
>>> - rrdcached is running, volatile data is written to a RAM disk, etc.
>>>
>>> My current approach is using systemtap and collecting only write accesses
>>> and their latencies.
>>>
>>> This I have, using the sys call to IO probe here:
>>> https://sourceware.org/systemtap/examples/keyword-index.html#FILE
>>> ...and grep, since I don't really understand all of it.
>>>
>>> To turn it into something more worthwhile that can be used by more people
>>> and show results easily,
>>> I want to use the flame graph thing as described at
>>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
>>>
>>> The whole toolkit seems be able to work with system tap.
>>>
>>> The problem:
>>> I'm apparently just too stupid. I don't know how to get started.
>>> I do not remotely grasp how to take the flamegraph git repo and the script I
>>> have and make them do "something"
>>> (something being, a sort on IO time spend per path element of the files
>>> written to)
>>>
>>> Did any of you try something similar?
>>> Did any of you work with flame graphs and can give some advice?
>>>
>>>
>>> Florian
>>>
>>> _______________________________________________
>>> Discuss mailing list
>>> [email protected]
>>> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
>>> This list provided by the League of Professional System Administrators
>>> http://lopsa.org/
>>>
>
> _______________________________________________
> Discuss mailing list
> [email protected]
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
> This list provided by the League of Professional System Administrators
>  http://lopsa.org/



-- 
Jesse Becker
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Systemtap / Dtrace and those great looking graphs

Reply via email to