Hello James,

James B:
> I've gotten this kernel oops after running the system continuously for about 
> 14-15 hours. 
> This only happens when the system is under load most of the time (load is 
> around 95% of CPU most of the time).
> If the system load is lower (idle being 40-50% load) this doesn't happen.

Although I can guess it is the repeated mv, ls, cat, touch, etc. under
aufs, would you describe more specifically about the heavy load?


> Kernel source: https://github.com/SolidRun/linux-linaro-stable-mx6
> Kernel version: 3.10.30 (patched with aufs3.10.x)
> Kernel arch: armv7 dual processor (SMP) - cubox-i

This source tree looks different v3.10.30 and these are added at least.
- ARM archetecture specific modification (arch/, drivers/, firmware/,
  include/, sound/ and tools/).
- block device has POWER_EFFICIENT workqueue (block/).
- debugfs supports atomic_t (fs/debugfs/).
- F2FS_FS_SECURITY, F2FS_CHECK_FS for f2fs, but you don't have such
  branches (fs/f2fs/).
- jffs2 awares MLC NAND (fs/jffs2/).
- ZBUD, ZSWAP looks interesting.
- heterogenius multiprocessor sounds interesting too...

All these look unrelated to aufs.
As long as ZBUD and ZSWAP are working correctly (if you enable them),
the problem seems inside aufs I am afraid.


> I've already compiled the kernel with DEBUG_INFO, AUFS_DEBUG, extended 
> BUG_ON, and DYNAMIC_PRINTK (but I don't see aufs debug statements in dmesg 
> although I have already echo "module aufs +p" > 
> /sys/kernel/debug/dynamic_debug/control at the beginning). 

Aufs has its own debug feature (long before dynamic_debug is
introduced). You can enable the module paramater "debug" dynamically.


> I have only used the first 3 aufs patches (aufs3-kbuild, aufs3-base, 
> aufs3-mmap). I didn't use aufs3-loopback (actually, in earlier builds, I used 
> that too - and got the same bug). The version of aufs used is the latest (as 
> of yesterday) commit for 3.10.x branch: 
> b72117ef0c528c4f6013845e6e704d92682fc913

And how do you mount aufs? Would you post you /proc/mounts?


> My apology for the large email size: I'm attaching the oops detail at the end 
> of this email since I'm not sure whether the list accept attachments).

No problem.


> From what I can see, it is the first kernel null pointer that is the problem. 
> The kernel doesn't immediately crash and continues to run, and all subsequent 
> access to aufs fails and results in more BUG_ON later on.

Agreed.
Since the message is produced __do_kernel_fault() in
arch/arm/mm/fault.c, it might be better to check ZBUD and ZSWAP as well
as aufs_rename().


> [55096.152331] Unable to handle kernel NULL pointer dereference at virtual 
> address 0000003f
        :::
> [55096.189965] CPU: 0 PID: 31816 Comm: mv Not tainted 3.10.30 #10
> [55096.194501] task: 8e3f9c00 ti: a8cdc000 task.ti: a8cdc000
> [55096.198608] PC is at dput+0x30/0x214
> [55096.200889] LR is at aufs_rename+0x1ea4/0x1ee4
        :::
> [55096.237572] Process mv (pid: 31816, stack limit = 0xa8cdc238)
> [55096.242018] Stack: (0xa8cddde0 to 0xa8cde000)
        :::
> [55096.362071] [<800d6da8>] (dput) from [<8022f908>] 
> (aufs_rename+0x1ea4/0x1ee4)
> [55096.367915] [<8022f908>] (aufs_rename) from [<800d059c>] 
> (vfs_rename+0x170/0x454)
> [55096.374106] [<800d059c>] (vfs_rename) from [<800d131c>] 
> (SyS_renameat+0x244/0x280)
> [55096.380385] [<800d131c>] (SyS_renameat) from [<8000eb00>] 
> (ret_fast_syscall+0x0/0x30)
> [55096.386918] Code: f57ff04f e320f004 e3540000 0a000039 (e5943050) 
        :::

I am unfamilier to this format for ARM. Anyway it tells the problem
happened in dput() which is called by aufs_rename().
Do you know the meaning of "stack limit = 0xa8cdc238"?
It looks out of the range of the stack shown in the next line. Is it a
problem of ZBUD or ZSWAP?

If you are using ZBUD or ZSWAP, if you can, please test them heavily
without aufs. On my side, I will try reproducing the problem on my intel
pc after knowing the detail of your workload test.


J. R. Okajima

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

Reply via email to