Re: rasdaemon and abrt

Junliang Li Tue, 08 Oct 2013 20:26:57 -0700

在 2013-10-02三的 12:32 +0200，Denys Vlasenko写道：
> On 10/01/2013 10:15 PM, Petr Holasek wrote:
> > On Tue, 01 Oct 2013, Denys Vlasenko wrote:
> >> On 09/27/2013 09:29 AM, Jiri Moskovcak wrote:
> >>> On 09/27/2013 08:46 AM, Junliang Li wrote:
> >>>> Fedora 19 has a tool named rasdaemon. From its maintainer, Mauro, I know
> >>>> that rasdaemon and ABRT will work together.  But I don't know much about
> >>>> that. Would anyone introduce something about rasdaemon and ABRT?
> >>>
> >>> Denys is responsible for rasdaemon&ABRT integration, so I'm adding him to 
> >>> the loop.
> >>
> >> IIUC rasdaemon does not send its data yet to abrt.
> >>
> >> rasdaemon developers work on the way to prevent
> >> floods of error reports: it's semi-trivial to generate
> >> a single report about an isolated ECC error on PCIe bus;
> >> but what if there are thousands of them per second?
> >>
> >> We (abrt team) provided documentation necessary
> >> to use abrt's "create problem data" API.
> >>
> >> We are ready to aid rasdaemon people if they have
> >> questions or proposals for changes in abrt.
> >> Some of them (Petr Holasek) are colocated with
> >> abrt team and can just walk over and talk with us.
> >>
> > 
> > Hello all,
> > 
> > to be honest, I still can't find time for digging into implementation of 
> > abrt
> > hook for rasdaemon as well as we still wait for Intel guys who implement 
> > code
> > for reducing floods of errors in some reasonable manner.
> 
> How about reporting first detected error to abrt right away, then,
> if more errors happen, hold on for a few seconds, then
> batch-report them as one problem ("1234 PCIe parity errors happened
> at 12:34 during 4 seconds on the device FOO" would be a nice way to report
> such a problem).
> 
> Increase cooldown period if errors keep coming, with a cap.
> We have something like this elsewhere in abrt:
> 
> unsigned cooldown_sec = 5;
> ...
>         cooldown_sec *= cooldown_sec;
>         if (cooldown_sec > 15 * 60)
>                 cooldown_sec = 15 * 60;
> 
> With formulas like above cooldown rises quickly, resulting in just
> a few problem reports even with constant flood of error events;
> yet, it does not grow to astronomical values - "collect PCIe errors
> for next 27 hours and report them as one"
> is obviously a bad idea too.
>


Cooldown period is a good idea. Let sysadm customize their report
threshold in rasdaemon would be OK. Maybe we just need add an plugin in
rasdaemon to customize threshold and work as abrt hook.

Regards,
Junliang Li

Re: rasdaemon and abrt

Reply via email to