Re: am35xx memory management issues
Hi Tony, On 13.11.2015 15:05, Markku Ahvenjärvi wrote: > Hi, > > On 12.11.2015 19:06, Tony Lindgren wrote: >> Hi, >> >> * Markku Ahvenjärvi <markku.ahvenja...@nomovok.com> [151112 07:26]: >>> Hello everyone, >>> >>> We have am3517 based board and are experiencing sporadic corruption of mm >>> structures. We've had this problem for months now and haven't really got >>> bottom of it. >>> >>> Our board is currently using 3.18.20, but with am3517-evm we've tried >>> pretty much everything between v3.14 and v4.2. So far we've been able to >>> reproduce it on am3517-evm, craneboard and beagleboard (rev. C3 and C4). We >>> have also tested am/dm37x-evm, am335x-evm and beagle bone black, no >>> problems seen. >>> >>> Usually kernel it panics in 'kernel BUG at mm/rmap.c:406!', but >>> occasionally there's 'BUG: Bad rss-counter state' prints followed by NULL >>> pointer deref or another BUG statement in mm/slab.c. Sometimes spinlock >>> lockup or already unlocked reported, so it is quite random. >>> >>> Reproducing can take from half hour up to few days. We are using stress-ng >>> with options: >>> stress-ng --cpu 1 --vm 3 --vm-bytes 64M --fork 4 >>> >>> In our tests we have noticed that kernel configuration affect frequency of >>> the problem. So far we haven't seen any with omap2plus_defconfig, but with >>> slimmer defconfig like the one we are using for our board we can get it in >>> few hours. We bisected our defconfig and omap2plus_defconfig, but couldn't >>> pinpoint any specific config that would cause these problems: it just got >>> less frequent until stopped occurring. To rule out any bad behaving >>> drivers, we basically disabled everything but serial and it just kept >>> crashing. >> >> Adding also LAKML to Cc. Can you check if it starts happening if you >> leave out other omaps from .config other than CONFIG_ARCH_OMAP3? >> That's to compile code only for ARMv7 and leave out ARMv6. >> >> Also please check if leaving out CONFIG_SMP_ON_UP affects things. > > Alright, will do. We've been testing omap2plus defconfig without other omaps and without CONFIG_SMP_ON_UP. So far we haven't seen any panics, but I've had only a few units testing it. Meanwhile we've been testing our custom board with a configuration that is quite close to omap2plus, including other omaps and CONFIG_SMP_ON_UP. We've had couple of panics, so it seems that these doesn't affect the problem. We had 15 units running stress-ng and it took ~8 days until we saw first panic, so if omap2plus is affected it is quite rare. Any other suggestions? Regards, Markku > >>> Someone was having quite similar problems back in 2012, but other than that >>> we've found nothing: >>> http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/ >>> >>> Anyone seen this kind of issues before? Any ideas what might cause this? >> >> If it starts happening after after leaving out ARMv6 or SMP_ON_UP, >> it could be a cache bug or missing errata that's needed. > > Right. > > Regards, > > Markku > >> >> Regards, >> >> Tony >> >> >>> [0.00] Booting Linux on physical CPU 0x0 >>> [0.00] Linux version 3.18.24 (markku@thinkpad) (gcc version 4.9.3 >>> 20141031 (prerelease) (Linaro GCC 2014.11) ) #2 PREEMPT Wed Nov 4 09:51:36 >>> EET 2015 >>> [0.00] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), >>> cr=10c5387d >>> [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing >>> instruction cache >>> [0.00] Machine model: TI AM3517 EVM (AM3517/05 TMDSEVM3517) >>> [0.00] cma: Reserved 8 MiB at 0x8f40 >>> [0.00] Memory policy: Data cache writeback >>> [0.00] On node 0 totalpages: 65280 >>> [0.00] free_area_init_node: node 0, pgdat c09be980, node_mem_map >>> cfce7000 >>> [0.00] Normal zone: 512 pages used for memmap >>> [0.00] Normal zone: 0 pages reserved >>> [0.00] Normal zone: 65280 pages, LIFO batch:15 >>> [0.00] HighMem zone: 1048574 pages exceeds freesize 0 >>> [0.00] CPU: All CPU(s) started in SVC mode. >>> [0.00] AM3517 ES1.1 (l2cache sgx neon ) >>> [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768 >>> [0.00] pcpu-alloc: [0] 0 >>> [0.00] Built 1 zonelists in Zone order, mobility grouping on. >>> Total
Re: am35xx memory management issues
Hi, On 12.11.2015 19:06, Tony Lindgren wrote: > Hi, > > * Markku Ahvenjärvi <markku.ahvenja...@nomovok.com> [151112 07:26]: >> Hello everyone, >> >> We have am3517 based board and are experiencing sporadic corruption of mm >> structures. We've had this problem for months now and haven't really got >> bottom of it. >> >> Our board is currently using 3.18.20, but with am3517-evm we've tried pretty >> much everything between v3.14 and v4.2. So far we've been able to reproduce >> it on am3517-evm, craneboard and beagleboard (rev. C3 and C4). We have also >> tested am/dm37x-evm, am335x-evm and beagle bone black, no problems seen. >> >> Usually kernel it panics in 'kernel BUG at mm/rmap.c:406!', but occasionally >> there's 'BUG: Bad rss-counter state' prints followed by NULL pointer deref >> or another BUG statement in mm/slab.c. Sometimes spinlock lockup or already >> unlocked reported, so it is quite random. >> >> Reproducing can take from half hour up to few days. We are using stress-ng >> with options: >> stress-ng --cpu 1 --vm 3 --vm-bytes 64M --fork 4 >> >> In our tests we have noticed that kernel configuration affect frequency of >> the problem. So far we haven't seen any with omap2plus_defconfig, but with >> slimmer defconfig like the one we are using for our board we can get it in >> few hours. We bisected our defconfig and omap2plus_defconfig, but couldn't >> pinpoint any specific config that would cause these problems: it just got >> less frequent until stopped occurring. To rule out any bad behaving drivers, >> we basically disabled everything but serial and it just kept crashing. > > Adding also LAKML to Cc. Can you check if it starts happening if you > leave out other omaps from .config other than CONFIG_ARCH_OMAP3? > That's to compile code only for ARMv7 and leave out ARMv6. > > Also please check if leaving out CONFIG_SMP_ON_UP affects things. Alright, will do. >> Someone was having quite similar problems back in 2012, but other than that >> we've found nothing: >> http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/ >> >> Anyone seen this kind of issues before? Any ideas what might cause this? > > If it starts happening after after leaving out ARMv6 or SMP_ON_UP, > it could be a cache bug or missing errata that's needed. Right. Regards, Markku > > Regards, > > Tony > > >> [0.00] Booting Linux on physical CPU 0x0 >> [0.00] Linux version 3.18.24 (markku@thinkpad) (gcc version 4.9.3 >> 20141031 (prerelease) (Linaro GCC 2014.11) ) #2 PREEMPT Wed Nov 4 09:51:36 >> EET 2015 >> [0.00] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), >> cr=10c5387d >> [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing >> instruction cache >> [0.00] Machine model: TI AM3517 EVM (AM3517/05 TMDSEVM3517) >> [0.00] cma: Reserved 8 MiB at 0x8f40 >> [0.00] Memory policy: Data cache writeback >> [0.00] On node 0 totalpages: 65280 >> [0.00] free_area_init_node: node 0, pgdat c09be980, node_mem_map >> cfce7000 >> [0.00] Normal zone: 512 pages used for memmap >> [0.00] Normal zone: 0 pages reserved >> [0.00] Normal zone: 65280 pages, LIFO batch:15 >> [0.00] HighMem zone: 1048574 pages exceeds freesize 0 >> [0.00] CPU: All CPU(s) started in SVC mode. >> [0.00] AM3517 ES1.1 (l2cache sgx neon ) >> [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768 >> [0.00] pcpu-alloc: [0] 0 >> [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total >> pages: 64768 >> [0.00] Kernel command line: console=ttyO2,115200 >> [0.00] PID hash table entries: 1024 (order: 0, 4096 bytes) >> [0.00] Dentry cache hash table entries: 32768 (order: 5, 131072 >> bytes) >> [0.00] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) >> [0.00] Memory: 239940K/261120K available (4809K kernel code, 341K >> rwdata, 1816K rodata, 2996K init, 353K bss, 21180K reserved, 0K highmem) >> [0.00] Virtual kernel memory layout: >> [0.00] vector : 0x - 0x1000 ( 4 kB) >> [0.00] fixmap : 0xffc0 - 0xffe0 (2048 kB) >> [0.00] vmalloc : 0xd080 - 0xff00 ( 744 MB) >> [0.00] lowmem : 0xc000 - 0xd000 ( 256 MB) >> [0.00] pkmap : 0xbfe0 - 0xc000 ( 2 MB) >> [0.00] modules : 0xbf00 - 0xbfe
am35xx memory management issues
Hello everyone, We have am3517 based board and are experiencing sporadic corruption of mm structures. We've had this problem for months now and haven't really got bottom of it. Our board is currently using 3.18.20, but with am3517-evm we've tried pretty much everything between v3.14 and v4.2. So far we've been able to reproduce it on am3517-evm, craneboard and beagleboard (rev. C3 and C4). We have also tested am/dm37x-evm, am335x-evm and beagle bone black, no problems seen. Usually kernel it panics in 'kernel BUG at mm/rmap.c:406!', but occasionally there's 'BUG: Bad rss-counter state' prints followed by NULL pointer deref or another BUG statement in mm/slab.c. Sometimes spinlock lockup or already unlocked reported, so it is quite random. Reproducing can take from half hour up to few days. We are using stress-ng with options: stress-ng --cpu 1 --vm 3 --vm-bytes 64M --fork 4 In our tests we have noticed that kernel configuration affect frequency of the problem. So far we haven't seen any with omap2plus_defconfig, but with slimmer defconfig like the one we are using for our board we can get it in few hours. We bisected our defconfig and omap2plus_defconfig, but couldn't pinpoint any specific config that would cause these problems: it just got less frequent until stopped occurring. To rule out any bad behaving drivers, we basically disabled everything but serial and it just kept crashing. Someone was having quite similar problems back in 2012, but other than that we've found nothing: http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/ Anyone seen this kind of issues before? Any ideas what might cause this? Thanks, Markku [0.00] Booting Linux on physical CPU 0x0 [0.00] Linux version 3.18.24 (markku@thinkpad) (gcc version 4.9.3 20141031 (prerelease) (Linaro GCC 2014.11) ) #2 PREEMPT Wed Nov 4 09:51:36 EET 2015 [0.00] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), cr=10c5387d [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache [0.00] Machine model: TI AM3517 EVM (AM3517/05 TMDSEVM3517) [0.00] cma: Reserved 8 MiB at 0x8f40 [0.00] Memory policy: Data cache writeback [0.00] On node 0 totalpages: 65280 [0.00] free_area_init_node: node 0, pgdat c09be980, node_mem_map cfce7000 [0.00] Normal zone: 512 pages used for memmap [0.00] Normal zone: 0 pages reserved [0.00] Normal zone: 65280 pages, LIFO batch:15 [0.00] HighMem zone: 1048574 pages exceeds freesize 0 [0.00] CPU: All CPU(s) started in SVC mode. [0.00] AM3517 ES1.1 (l2cache sgx neon ) [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768 [0.00] pcpu-alloc: [0] 0 [0.00] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 64768 [0.00] Kernel command line: console=ttyO2,115200 [0.00] PID hash table entries: 1024 (order: 0, 4096 bytes) [0.00] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) [0.00] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) [0.00] Memory: 239940K/261120K available (4809K kernel code, 341K rwdata, 1816K rodata, 2996K init, 353K bss, 21180K reserved, 0K highmem) [0.00] Virtual kernel memory layout: [0.00] vector : 0x - 0x1000 ( 4 kB) [0.00] fixmap : 0xffc0 - 0xffe0 (2048 kB) [0.00] vmalloc : 0xd080 - 0xff00 ( 744 MB) [0.00] lowmem : 0xc000 - 0xd000 ( 256 MB) [0.00] pkmap : 0xbfe0 - 0xc000 ( 2 MB) [0.00] modules : 0xbf00 - 0xbfe0 ( 14 MB) [0.00] .text : 0xc0008000 - 0xc0680984 (6627 kB) [0.00] .init : 0xc0681000 - 0xc096e000 (2996 kB) [0.00] .data : 0xc096e000 - 0xc09c354c ( 342 kB) [0.00].bss : 0xc09c354c - 0xc0a1b97c ( 354 kB) [0.00] Preemptible hierarchical RCU implementation. [0.00] NR_IRQS:16 nr_irqs:16 16 [0.00] IRQ: Found an INTC at 0xfa20 (revision 4.0) with 96 interrupts [0.00] Clocking rate (Crystal/Core/MPU): 26.0/332/600 MHz [0.00] OMAP clockevent source: timer2 at 1300 Hz [0.23] sched_clock: 32 bits at 13MHz, resolution 76ns, wraps every 330382100403ns [0.58] OMAP clocksource: timer1 at 1300 Hz [0.000598] Console: colour dummy device 80x30 [0.000635] Calibrating delay loop... 589.82 BogoMIPS (lpj=294912) [0.008980] pid_max: default: 32768 minimum: 301 [0.009168] Security Framework initialized [0.009264] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes) [0.009282] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes) [0.010313] CPU: Testing write buffer coherency: ok [0.010936] Setting up static identity map for 0x80496c78 - 0x80496cd0 [0.013878] devtmpfs: initialized [0.016530] VFP