On 2018-06-14 23:12, Preston Bennes wrote: > Greetings audit mailing list, > > I've got an AWS instance running an Amazon Linux kernel, > 4.9.77-31.58.amzn1.x86_64 > with the base OS being CentOS 6. The instance had one program with the > audit handle (Proprietary closed source software, Lacework agent). I > installed and started OSQuery which attempted to acquire the audit handle. > I'm unsure if osquery was successful or not, because I was unable to ssh in > to the server to investigate. I ended up having to restart the instance. > Almost immediately after starting osquery, sshd got stuck in D state. > syslog has a hung task warning and backtrace that provides some information:
[reformatting dump info to be more readable...] > INFO: task sshd:1840 blocked for more than 10 seconds. > Tainted: G E 4.9.77-31.58.amzn1.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > sshd D 0 1840 1839 0x00000080 > 0000000000000000 ffff8802025c6540 ffff88003684d940 ffff880205f3bb80 > ffff8802072582c0 ffffc900049bfc60 ffffffff81556e62 0000000000000001 > 004200ca00000001 ffff8802072582c0 0000000000000000 ffffffff81a65140 > Call Trace: > [<ffffffff81556e62>] ? __schedule+0x242/0x700 > [<ffffffff8155734c>] schedule+0x2c/0x80 > [<ffffffff815575ee>] schedule_preempt_disabled+0xe/0x10 > [<ffffffff81558f05>] __mutex_lock_slowpath+0x95/0x110 > [<ffffffff8147a6f8>] ? __alloc_skb+0x78/0x1e0 > [<ffffffff81558f97>] mutex_lock+0x17/0x30 > [<ffffffff811178bd>] audit_receive+0x1d/0x90 > [<ffffffff814c4976>] netlink_unicast+0x176/0x220 > [<ffffffff814c4cf6>] netlink_sendmsg+0x2d6/0x390 > [<ffffffff814719fe>] sock_sendmsg+0x3e/0x50 > [<ffffffff81471ead>] SYSC_sendto+0x11d/0x150 > [<ffffffff8111c68b>] ? __audit_syscall_entry+0xbb/0x100 > [<ffffffff81003478>] ? syscall_trace_enter+0x1c8/0x2c0 > [<ffffffff814728ee>] SyS_sendto+0xe/0x10 > [<ffffffff81003b09>] do_syscall_64+0x59/0xc0 > [<ffffffff8155bd70>] entry_SYSCALL64_slow_path+0x25/0x25 > > I've been doing some reading (ex. > https://www.redhat.com/archives/linux-audit/2016-February/msg00025.html ) > and my understanding is that osquery should have been able to acquire the > audit handle, trampling lacework's ("Last one wins"), but I don't have "Last one wins" has been fixed for a couple of years now. That behaviour orphaned a previous daemon which was a bug. For your usage, there still can "only be one". There is work to enable multiple audit daemons, but not for this sort of reason. > access to the Lacework code to know how it might handle that situation (I'm > engaging their support separately). I also noticed the patch set for 4.17 > seemed to include some changes around the code path in the backtrace. I'm > trying to understand this behavior and determine if it's a bug, if said bug > has already been fixed by a patch between 4.9.77 and 4.17, or that this > issue is a lack of my understanding of the behavior of the audit system. It > is surprising to me that an audit system related issue would result in sshd > getting stuck in D state. Several other processes on the system continued > running without incident. Processed launched out of cron also got stuck in > D state. I would be grateful for some expert insight. If this isn't a bug > and is a misunderstanding on my part, is there any way to configure the > audit system such that an issue won't result in processes getting stuck in > D state? I don't quite understand how sshd is implicated other than it sending USER class messages to audit for logging. This shouldn't conflict with any daemon contention issues. > Thanks, > Preston Bennes - RGB -- Richard Guy Briggs <[email protected]> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 -- Linux-audit mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-audit
