Last thing. moving synchronize_srcu(&fsnotify_mark_srcu) out of the for(;;) loop in fs/notify/mark.c appears to solve the stability issues for me. I don't know enough about kernel internals to determine if this is doing lots of other bad things to my system or not.
Cheers, peter On Tue, Apr 17, 2012 at 11:24 AM, Peter Moody <[email protected]> wrote: > and my config.gz > > On Tue, Apr 17, 2012 at 10:56 AM, Peter Moody <[email protected]> wrote: >> Here's a trace with debugging turned way up plus a few extra printk's >> added to fs/notify/mark.c. I'm looping through private_destroy_list >> before and after the call to synchronize_srcu. >> >> I can reproduce this reliably with kvm with 2 virtual processors: >> Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012 >> x86_64 GNU/Linux >> >> Cheers, >> peter >> >> On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <[email protected]> wrote: >>> please please please keep on list. Everything you say might help track >>> it down! >>> >>> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote: >>>> (please let me know if I should take this off-list) >>>> >>>> One other thing (again, maybe already known), but this seems to be >>>> exacerbated by SMP. On my machine, I can't reproduce the crash if I >>>> booth with maxcpus=1. >>>> >>>> Still hunting. >>>> >>>> Cheers, >>>> peter >>>> >>>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <[email protected]> wrote: >>>> > This may already be known, but the issue seems to be limited to watch >>>> > rules. With any watch rules, I can reliably crash my machine while >>>> > freeing a watch rule after only starting/stopping auditd a few times. >>>> > With no watch rules, I have no issues. >>>> > >>>> > Cheers, >>>> > peter >>>> > >>>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <[email protected]> >>>> > wrote: >>>> >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it >>>> >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm >>>> >> seeing is >>>> >> also in 3.2.9. >>>> >> >>>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes >>>> >> either >>>> >> in audit subsystem or in fsnotify. I'll try to reproduce in latest >>>> >> 3.2.13 >>>> >> and repost the oops, but i'm 99% confident it will be the same. >>>> >> >>>> >> Sadly nobody except you seems to pay attention to this problem, probably >>>> >> because it requires special conditions to reproduce (really, who starts >>>> >> and >>>> >> stops auditd every 5 seconds on a production server?). We only ran into >>>> >> it >>>> >> because one of our servers would randomly oops and then freeze about >>>> >> each >>>> >> month after stopping and then starting >>>> >> >>>> >> auditd >>>> >> >>>> >> every morning (and the stop-start sequence was needed to workaround a >>>> >> bug >>>> >> somewhere that would hang a >>>> >> >>>> >> gzip >>>> >> >>>> >> running on a file outside a watched folder). >>>> >> >>>> >> Anyway, as a last note, i have a feeling that the oops is not exactly >>>> >> random, there is a pattern, just that i haven't figured it out >>>> >> completely >>>> >> yet. >>>> >> >>>> >> Will keep you >>>> >> >>>> >> uptodate >>>> >> >>>> >> with the things i find out. >>>> >> >>>> >> V. >>>> >> >>>> >> On Mar 29, 2012 4:14 AM, "Eric Paris" <[email protected]> wrote: >>>> >>> >>>> >>> That patch fixes a BUG() . The report has a NULL ptr deref and some >>>> >>> apparent list correuption.... Sadly they aren't the same.... >>>> >>> >>>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote: >>>> >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in >>>> >>> > the subject would reliably oops my machine. >>>> >>> > >>>> >>> > [1] >>>> >>> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fed474857efbed79cd390d0aee224231ca718f63 >>>> >>> > >>>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <[email protected]> >>>> >>> > wrote: >>>> >>> > > Are you still able to reliably reproduce this oops? I'm trying to >>>> >>> > > track this down because this bug (or a very similar bug) is causing >>>> >>> > > some significant headaches here at work, but I haven't had a lot of >>>> >>> > > luck. I'm using usermode linux, though, so that might be >>>> >>> > > interfering >>>> >>> > > with things. >>>> >>> > > >>>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram <[email protected]> >>>> >>> > > wrote: >>>> >>> > >> Finally i found some time and spare server to retest the oops and >>>> >>> > >> list_add >>>> >>> > >> corruptions i was getting with the 3.x kernels and auditd 2.1.3. >>>> >>> > >> >>>> >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2 and >>>> >>> > >> kernel.org's >>>> >>> > >> 3.2.9. >>>> >>> > >> >>>> >>> > >> Both get the oops/BUG in the same way and after that, they keep >>>> >>> > >> pouring >>>> >>> > >> list_add corruptions with audit_prune_tre(truncated?) and >>>> >>> > >> auditctl as >>>> >>> > >> comms. >>>> >>> > >> >>>> >>> > >> Since this is not about Gentoo's kernel only, i'll post here the >>>> >>> > >> oops >>>> >>> > >> in >>>> >>> > >> 3.2.9 and also attach some list_add corruptions. >>>> >>> > >> >>>> >>> > >> 3.2.9 BUG: >>>> >>> > >> >>>> >>> > >> kernel: [ 301.240011] BUG: unable to handle kernel NULL pointer >>>> >>> > >> dereference >>>> >>> > >> at (null) >>>> >>> > >> kernel: [ 301.240305] IP: [<c1238dd0>] __list_del_entry+0x20/0xe0 >>>> >>> > >> kernel: [ 301.240481] *pdpt = 0000000000000000 *pde = >>>> >>> > >> f000ddc8f000ddc8 >>>> >>> > >> kernel: [ 301.240698] Oops: 0000 [#1] SMP >>>> >>> > >> kernel: [ 301.240910] >>>> >>> > >> kernel: [ 301.241030] Pid: 642, comm: fsnotify_mark Not tainted >>>> >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396 >>>> >>> > >> kernel: [ 301.241370] EIP: 0060:[<c1238dd0>] EFLAGS: 00010287 >>>> >>> > >> CPU: 6 >>>> >>> > >> kernel: [ 301.241498] EIP is at __list_del_entry+0x20/0xe0 >>>> >>> > >> kernel: [ 301.241623] EAX: f4fae544 EBX: f47cffa4 ECX: ffffffff >>>> >>> > >> EDX: >>>> >>> > >> 00000000 >>>> >>> > >> kernel: [ 301.241751] ESI: f4fae544 EDI: f4fae508 EBP: f47cff7c >>>> >>> > >> ESP: >>>> >>> > >> f47cff64 >>>> >>> > >> kernel: [ 301.241879] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: >>>> >>> > >> 0068 >>>> >>> > >> kernel: [ 301.242005] Process fsnotify_mark (pid: 642, >>>> >>> > >> ti=f47ce000 >>>> >>> > >> task=f4f47c00 task.ti=f47ce000) >>>> >>> > >> kernel: [ 301.242207] Stack: >>>> >>> > >> kernel: [ 301.242327] c10813c0 f47cffa4 f4f47c00 f4e70888 >>>> >>> > >> f47cff7c >>>> >>> > >> f47cffa4 f47cffb8 c10f6976 >>>> >>> > >> kernel: [ 301.242882] ffffffc3 f4f47c00 f4f47c00 00000000 >>>> >>> > >> f4f47c00 >>>> >>> > >> c10530c0 f47cff9c f47cff9c >>>> >>> > >> kernel: [ 301.243438] f4fae544 f4fae544 f4c47f58 00000000 >>>> >>> > >> c10f68f0 >>>> >>> > >> f47cffe4 c1052834 00000000 >>>> >>> > >> kernel: [ 301.243995] Call Trace: >>>> >>> > >> kernel: [ 301.244119] [<c10813c0>] ? >>>> >>> > >> rcu_check_callbacks+0x110/0x110 >>>> >>> > >> kernel: [ 301.244248] [<c10f6976>] >>>> >>> > >> fsnotify_mark_destroy+0x86/0x120 >>>> >>> > >> kernel: [ 301.244377] [<c10530c0>] ? >>>> >>> > >> abort_exclusive_wait+0x80/0x80 >>>> >>> > >> kernel: [ 301.244504] [<c10f68f0>] ? fsnotify_put_mark+0x30/0x30 >>>> >>> > >> kernel: [ 301.244631] [<c1052834>] kthread+0x74/0x80 >>>> >>> > >> kernel: [ 301.244756] [<c10527c0>] ? >>>> >>> > >> kthread_flush_work_fn+0x10/0x10 >>>> >>> > >> kernel: [ 301.244885] [<c1582ab6>] kernel_thread_helper+0x6/0xd >>>> >>> > >> kernel: [ 301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90 55 >>>> >>> > >> 89 >>>> >>> > >> e5 53 83 >>>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00 0f >>>> >>> > >> 84 >>>> >>> > >> 8e 00 >>>> >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04 89 0a 83 >>>> >>> > >> c4 >>>> >>> > >> 14 >>>> >>> > >> kernel: [ 301.248195] EIP: [<c1238dd0>] >>>> >>> > >> __list_del_entry+0x20/0xe0 >>>> >>> > >> SS:ESP >>>> >>> > >> 0068:f47cff64 >>>> >>> > >> kernel: [ 301.248414] CR2: 0000000000000000 >>>> >>> > >> kernel: [ 301.248538] ---[ end trace 15082dbfb353f84c ]--- >>>> >>> > >> >>>> >>> > >> The kernel was compiled with the following DEBUG support (the >>>> >>> > >> bolded >>>> >>> > >> one >>>> >>> > >> were requested by Gentoo's Dev: >>>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y >>>> >>> > >> CONFIG_SLUB_DEBUG=y >>>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y >>>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y >>>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y >>>> >>> > >> CONFIG_AIC94XX_DEBUG=y >>>> >>> > >> CONFIG_USB_DEBUG=y >>>> >>> > >> CONFIG_DEBUG_KERNEL=y >>>> >>> > >> CONFIG_SCHED_DEBUG=y >>>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y >>>> >>> > >> CONFIG_DEBUG_PI_LIST=y >>>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y >>>> >>> > >> CONFIG_DEBUG_INFO=y >>>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y >>>> >>> > >> CONFIG_DEBUG_LIST=y >>>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y >>>> >>> > >> CONFIG_DEBUG_RODATA=y >>>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y >>>> >>> > >> >>>> >>> > >> I attached the kernel config i used for 3.2.9 to generate this >>>> >>> > >> oops >>>> >>> > >> and >>>> >>> > >> warnings. >>>> >>> > >> >>>> >>> > >> From the list_add warnings that come after, out of 805 warnings i >>>> >>> > >> processed, >>>> >>> > >> after masking with XXXXX the PID and next= values that kept >>>> >>> > >> changing >>>> >>> > >> in >>>> >>> > >> every one, i got 26 types of MD5. I also attached the files >>>> >>> > >> relevant >>>> >>> > >> as an >>>> >>> > >> archive to this email. >>>> >>> > >> >>>> >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the time >>>> >>> > >> to >>>> >>> > >> at >>>> >>> > >> least test to confirm or not the problems i'm seeing (or >>>> >>> > >> everybody's >>>> >>> > >> thinking that nobody would restart auditd so often, so the bug >>>> >>> > >> it's >>>> >>> > >> not that >>>> >>> > >> serious). >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> Thank you for your time. >>>> >>> > >> >>>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram <[email protected]> >>>> >>> > >> wrote: >>>> >>> > >> >>>> >>> > >> >>>> >>> > >> -- >>>> >>> > >> Linux-audit mailing list >>>> >>> > >> [email protected] >>>> >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > -- >>>> >>> > > Peter Moody Google 1.650.253.7306 >>>> >>> > > Security Engineer pgp:0xC3410038 >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> >>>> >>> >>>> >> >>>> > >>>> > >>>> > >>>> > -- >>>> > Peter Moody Google 1.650.253.7306 >>>> > Security Engineer pgp:0xC3410038 >>>> >>>> >>>> >>> >>> >> >> >> >> -- >> Peter Moody Google 1.650.253.7306 >> Security Engineer pgp:0xC3410038 > > > > -- > Peter Moody Google 1.650.253.7306 > Security Engineer pgp:0xC3410038 -- Peter Moody Google 1.650.253.7306 Security Engineer pgp:0xC3410038 -- Linux-audit mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-audit
