On Wed, Dec 2, 2020 at 11:33 PM Joe Wulf <joe_w...@yahoo.com> wrote: > I would like to suggest providing a mechanism where admins can query the > status or state of backlog issues (wait time, sums, etc...). Maybe the > intent is to expand the output of status checking of auditd. > > I believe further clarity is beneficial on the setting of the > 'backlog_wait_sum' (or to whatever the name evolves to) initially. > - How it evolves over time > - What the conditions in the system, or auditing, would change it > - What conditions admins should pay attention to for informational > understanding of status > - What conditions admins should realize exist such that adjustments are > needed > (and suggestions to what those adjustments should be) > - What new guidance will admins have for building adjusting audit.rules > around this > > Consider the scenario where auditing has been 'working fine' for days. > Little to no active admin monitoring. > Events occur to spike the auditing such that backloging of audit records > dramatically increases. > (for some reason) admins now come looking to investigate. > Assuming they do: 'systemctl status auditd' the newly presented 'state' > of the 'backlog_wait_sum' will show some evidence. > Q: Is that just a moment in time? > Q: What information here will give the perspective things are good/ok > 'now', versus some action needs to be taken? > > Maybe that isn't a great scenario, or good questions----it is what occurs > to me at the moment. > > Thank you. > > R, > -Joe Wulf > > > On Wednesday, July 1, 2020, 5:33:14 PM EDT, Max Englander < > max.englan...@gmail.com> wrote: > > > In environments where the preservation of audit events and predictable > > usage of system memory are prioritized, admins may use a combination of > > --backlog_wait_time and -b options at the risk of degraded performance > > resulting from backlog waiting. In some cases, this risk may be > > preferred to lost events or unbounded memory usage. Ideally, this risk > > can be mitigated by making adjustments when backlog waiting is > detected. > > > > However, detection can be diffult using the currently available > metrics. > > For example, an admin attempting to debug degraded performance may > > falsely believe a full backlog indicates backlog waiting. It may turn > > out the backlog frequently fills up but drains quickly. > > > > To make it easier to reliably track degraded performance to backlog > > waiting, this patch makes the following changes: > > > > Add a new field backlog_wait_sum to the audit status reply. Initialize > > this field to zero. Add to this field the total time spent by the > > current task on scheduled timeouts while the backlog limit is exceeded. > > > > Tested on Ubuntu 18.04 using complementary changes to the audit > > userspace: https://github.com/linux-audit/audit-userspace/pull/134. > > <snip> >
Hi Joe, Not sure I can address all your points above, but the way that we monitor Linux audit internals at my employer is to continuously monitor the audit status response with short evaluation windows. - We compute a rate of change on the lost field, and alert if the there are more than N lost records per second on average - We compute the backlog utilization by computing backlog/backlog_limit, and alert if that goes above 75% at any point in time - If/when we run on a kernel that has backlog_wait_time_actual, we'll monitor on that as well, setting thresholds around where we'd expect growth in this value to result in service degradation. If we get an alert, and it is just a blip that goes away and doesn't come back, we probably won't spend a lot of time investigating. However, if we see that the alert is frequently active across multiple hosts, that will prompt us to investigate. As far as what action we would take, it would depend on the precise values in the audit status reply, as well as other information we had gathered from our system. For example, if we observed elevated values for backlog and backlog_wait_time_actual, we might first investigate other environmental factors such as whether the auditd daemon was crashed or starved for CPU time. If we saw that lost was high but backlog was low that might indicate to us that the rate limit is being exceeded, or that the kernel is out of memory. I agree with you that it would help to expand the metrics reported in audit status. For example, reporting the number of times an audit record was lost due to rate limit being exceeded would help. Not sure how responsive this is to your questions. Hope it helps some. Thanks, Max
-- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit