Re: [Openais] [RFC] simple blackbox

Steven Dake Thu, 09 Oct 2008 08:56:58 -0700

On Thu, 2008-10-09 at 08:44 +0200, Lars Marowsky-Bree wrote:
> On 2008-10-08T21:08:16, Steven Dake <[EMAIL PROTECTED]> wrote:
> 
> > Attached is my version which is as of yet incomplete.  The general
> > concept is to allow very high performance event tracing with minimal
> > formatting overhead (formatting is done in a separate program after a
> > crash or to debug current program state).  I'd also like to get rid of
> > the critical section code in the current logsys as much as possible.
> > 
> > It uses a chunked circular word (32 bit) buffer of arbitrary length to
> > store data records.  It allows a variable length list of variable length
> > arguments.  It records the subsystem name, filename, function name, line
> > number, log identification used for replay output mapping, and record
> > number as well as the arguments and their lengths.  The implementation
> > only ever makes one copy of data at any time and copies/stores most
> > things in 1 word operations.  memcpy sucks on x86 for small memory
> > segments, which is why there is a C implementation of memcpy included.
> 
> I am not entirely sure I agree with this design.
> 
> The goal was to have a blackbox which we cannot just retroactively dump,
> but also easily recover from a core or kernel crashdump. 
>


the array is global and can easily be obtained from a core file with a
simple script.

> I am not entirely convinced that that is the right time to start using
> pointer-heavy structures such as atomized strings, a separate memory
> allocator, etc.
> 
there are only two indexes, and the code doesn't really contain a memory
allocator, just a garbage collector.

> I would maintain that I'd prefer a ring buffer of messages logged (at +n
> log levels above the actual log level configured), in their printable
> char[] form.
> 

too much overhead - see below.  Also then trace data is lost during
normal runtime operations.  IMO we want to maintain this trace
information at all times during every run.

> That would also ensure that we are looking at exactly the same strings
> as were eventually logged externally, which could be very much useful to
> correlate the logs from the blackbox with those in the system logs.
> 
this is indeed a problem in the current design.
> 
> Regards,
>     Lars

Let me explain the real world problem and design goals I had when I
started so you understand where the implementation comes from.

I have a real world problem in the checkpoint service where under
certain circumstances with 10-12 hrs of test runs during recovery
checkpoint segfaults.  I have instrumented it heavily with logsys and
the error goes away entirely (because it dramatically changes the timing
characteristics.  So there is the problem - instrumentation overhead
causes errors that were once there to disappear.

How will this be used?  All of those printfs will be turned into log_rec
messages.  Since log_rec is super high performance, I am hopeful it
doesn't impact the reproduction of the error.  Also those log_rec
commands will be _left in the code_ permanently.  Anytime there is a
field failure, instead of providing a log file, users would provide the
flight data which could be analyzed and debugged.  I could look at the
trace data and say "wow that event looks out of place..".  Complete data
structures can be logged instead of logging individual pieces of the
sprintf functionality.

The goals are:
1) super high performance
2) tracing left in permanently
3) uses circular buffer properly and efficiently
4) easy to recover trace data from core files
5) allows variable length parameters to be logged which are binary in
format (such as data structures)
6) records all information necessary for the person supporting the
failure to understand exactly what happened during execution.

> 

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] [RFC] simple blackbox

Reply via email to