Re: am35xx memory management issues

2015-11-24 Thread Markku Ahvenjärvi
Hi Tony,

On 13.11.2015 15:05, Markku Ahvenjärvi wrote:
> Hi,
> 
> On 12.11.2015 19:06, Tony Lindgren wrote:
>> Hi,
>>
>> * Markku Ahvenjärvi <markku.ahvenja...@nomovok.com> [151112 07:26]:
>>> Hello everyone,
>>>
>>> We have am3517 based board and are experiencing sporadic corruption of mm 
>>> structures. We've had this problem for months now and haven't really got 
>>> bottom of it.
>>>
>>> Our board is currently using 3.18.20, but with am3517-evm we've tried 
>>> pretty much everything between v3.14 and v4.2. So far we've been able to 
>>> reproduce it on am3517-evm, craneboard and beagleboard (rev. C3 and C4). We 
>>> have also tested am/dm37x-evm, am335x-evm and beagle bone black, no 
>>> problems seen.
>>>
>>> Usually kernel it panics in 'kernel BUG at mm/rmap.c:406!', but 
>>> occasionally there's 'BUG: Bad rss-counter state' prints followed by NULL 
>>> pointer deref or another BUG statement in mm/slab.c. Sometimes spinlock 
>>> lockup or already unlocked reported, so it is quite random.
>>>
>>> Reproducing can take from half hour up to few days. We are using stress-ng 
>>> with options:
>>> stress-ng --cpu 1 --vm 3 --vm-bytes 64M --fork 4
>>>
>>> In our tests we have noticed that kernel configuration affect frequency of 
>>> the problem. So far we haven't seen any with omap2plus_defconfig, but with 
>>> slimmer defconfig like the one we are using for our board we can get it in 
>>> few hours. We bisected our defconfig and omap2plus_defconfig, but couldn't 
>>> pinpoint any specific config that would cause these problems: it just got 
>>> less frequent until stopped occurring. To rule out any bad behaving 
>>> drivers, we basically disabled everything but serial and it just kept 
>>> crashing.
>>
>> Adding also LAKML to Cc. Can you check if it starts happening if you
>> leave out other omaps from .config other than CONFIG_ARCH_OMAP3?
>> That's to compile code only for ARMv7 and leave out ARMv6.
>>
>> Also please check if leaving out CONFIG_SMP_ON_UP affects things.
> 
> Alright, will do.

We've been testing omap2plus defconfig without other omaps and without 
CONFIG_SMP_ON_UP. So far we haven't seen any panics, but I've had only a few 
units testing it.

Meanwhile we've been testing our custom board with a configuration that is 
quite close to omap2plus, including other omaps and CONFIG_SMP_ON_UP. We've had 
couple of panics, so it seems that these doesn't affect the problem. We had 15 
units running stress-ng and it took ~8 days until we saw first panic, so if 
omap2plus is affected it is quite rare.

Any other suggestions?

Regards,

Markku

> 
>>> Someone was having quite similar problems back in 2012, but other than that 
>>> we've found nothing:
>>> http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/
>>>
>>> Anyone seen this kind of issues before? Any ideas what might cause this?
>>
>> If it starts happening after after leaving out ARMv6 or SMP_ON_UP,
>> it could be a cache bug or missing errata that's needed.
> 
> Right.
> 
> Regards,
> 
> Markku
> 
>>
>> Regards,
>>
>> Tony
>>
>>
>>> [0.00] Booting Linux on physical CPU 0x0
>>> [0.00] Linux version 3.18.24 (markku@thinkpad) (gcc version 4.9.3 
>>> 20141031 (prerelease) (Linaro GCC 2014.11) ) #2 PREEMPT Wed Nov 4 09:51:36 
>>> EET 2015
>>> [0.00] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), 
>>> cr=10c5387d
>>> [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing 
>>> instruction cache
>>> [0.00] Machine model: TI AM3517 EVM (AM3517/05 TMDSEVM3517)
>>> [0.00] cma: Reserved 8 MiB at 0x8f40
>>> [0.00] Memory policy: Data cache writeback
>>> [0.00] On node 0 totalpages: 65280
>>> [0.00] free_area_init_node: node 0, pgdat c09be980, node_mem_map 
>>> cfce7000
>>> [0.00]   Normal zone: 512 pages used for memmap
>>> [0.00]   Normal zone: 0 pages reserved
>>> [0.00]   Normal zone: 65280 pages, LIFO batch:15
>>> [0.00]   HighMem zone: 1048574 pages exceeds freesize 0
>>> [0.00] CPU: All CPU(s) started in SVC mode.
>>> [0.00] AM3517 ES1.1 (l2cache sgx neon )
>>> [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
>>> [0.00] pcpu-alloc: [0] 0
>>> [0.00] Built 1 zonelists in Zone order, mobility grouping on.  
>>> Total

Re: am35xx memory management issues

2015-11-13 Thread Markku Ahvenjärvi
Hi,

On 12.11.2015 19:06, Tony Lindgren wrote:
> Hi,
> 
> * Markku Ahvenjärvi <markku.ahvenja...@nomovok.com> [151112 07:26]:
>> Hello everyone,
>>
>> We have am3517 based board and are experiencing sporadic corruption of mm 
>> structures. We've had this problem for months now and haven't really got 
>> bottom of it.
>>
>> Our board is currently using 3.18.20, but with am3517-evm we've tried pretty 
>> much everything between v3.14 and v4.2. So far we've been able to reproduce 
>> it on am3517-evm, craneboard and beagleboard (rev. C3 and C4). We have also 
>> tested am/dm37x-evm, am335x-evm and beagle bone black, no problems seen.
>>
>> Usually kernel it panics in 'kernel BUG at mm/rmap.c:406!', but occasionally 
>> there's 'BUG: Bad rss-counter state' prints followed by NULL pointer deref 
>> or another BUG statement in mm/slab.c. Sometimes spinlock lockup or already 
>> unlocked reported, so it is quite random.
>>
>> Reproducing can take from half hour up to few days. We are using stress-ng 
>> with options:
>> stress-ng --cpu 1 --vm 3 --vm-bytes 64M --fork 4
>>
>> In our tests we have noticed that kernel configuration affect frequency of 
>> the problem. So far we haven't seen any with omap2plus_defconfig, but with 
>> slimmer defconfig like the one we are using for our board we can get it in 
>> few hours. We bisected our defconfig and omap2plus_defconfig, but couldn't 
>> pinpoint any specific config that would cause these problems: it just got 
>> less frequent until stopped occurring. To rule out any bad behaving drivers, 
>> we basically disabled everything but serial and it just kept crashing.
> 
> Adding also LAKML to Cc. Can you check if it starts happening if you
> leave out other omaps from .config other than CONFIG_ARCH_OMAP3?
> That's to compile code only for ARMv7 and leave out ARMv6.
> 
> Also please check if leaving out CONFIG_SMP_ON_UP affects things.

Alright, will do.

>> Someone was having quite similar problems back in 2012, but other than that 
>> we've found nothing:
>> http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/
>>
>> Anyone seen this kind of issues before? Any ideas what might cause this?
> 
> If it starts happening after after leaving out ARMv6 or SMP_ON_UP,
> it could be a cache bug or missing errata that's needed.

Right.

Regards,

Markku

> 
> Regards,
> 
> Tony
> 
> 
>> [0.00] Booting Linux on physical CPU 0x0
>> [0.00] Linux version 3.18.24 (markku@thinkpad) (gcc version 4.9.3 
>> 20141031 (prerelease) (Linaro GCC 2014.11) ) #2 PREEMPT Wed Nov 4 09:51:36 
>> EET 2015
>> [0.00] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), 
>> cr=10c5387d
>> [0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing 
>> instruction cache
>> [0.00] Machine model: TI AM3517 EVM (AM3517/05 TMDSEVM3517)
>> [0.00] cma: Reserved 8 MiB at 0x8f40
>> [0.00] Memory policy: Data cache writeback
>> [0.00] On node 0 totalpages: 65280
>> [0.00] free_area_init_node: node 0, pgdat c09be980, node_mem_map 
>> cfce7000
>> [0.00]   Normal zone: 512 pages used for memmap
>> [0.00]   Normal zone: 0 pages reserved
>> [0.00]   Normal zone: 65280 pages, LIFO batch:15
>> [0.00]   HighMem zone: 1048574 pages exceeds freesize 0
>> [0.00] CPU: All CPU(s) started in SVC mode.
>> [0.00] AM3517 ES1.1 (l2cache sgx neon )
>> [0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
>> [0.00] pcpu-alloc: [0] 0
>> [0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
>> pages: 64768
>> [0.00] Kernel command line: console=ttyO2,115200
>> [0.00] PID hash table entries: 1024 (order: 0, 4096 bytes)
>> [0.00] Dentry cache hash table entries: 32768 (order: 5, 131072 
>> bytes)
>> [0.00] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
>> [0.00] Memory: 239940K/261120K available (4809K kernel code, 341K 
>> rwdata, 1816K rodata, 2996K init, 353K bss, 21180K reserved, 0K highmem)
>> [0.00] Virtual kernel memory layout:
>> [0.00] vector  : 0x - 0x1000   (   4 kB)
>> [0.00] fixmap  : 0xffc0 - 0xffe0   (2048 kB)
>> [0.00] vmalloc : 0xd080 - 0xff00   ( 744 MB)
>> [0.00] lowmem  : 0xc000 - 0xd000   ( 256 MB)
>> [0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
>> [0.00] modules : 0xbf00 - 0xbfe

am35xx memory management issues

2015-11-12 Thread Markku Ahvenjärvi
Hello everyone,

We have am3517 based board and are experiencing sporadic corruption of mm 
structures. We've had this problem for months now and haven't really got bottom 
of it.

Our board is currently using 3.18.20, but with am3517-evm we've tried pretty 
much everything between v3.14 and v4.2. So far we've been able to reproduce it 
on am3517-evm, craneboard and beagleboard (rev. C3 and C4). We have also tested 
am/dm37x-evm, am335x-evm and beagle bone black, no problems seen.

Usually kernel it panics in 'kernel BUG at mm/rmap.c:406!', but occasionally 
there's 'BUG: Bad rss-counter state' prints followed by NULL pointer deref or 
another BUG statement in mm/slab.c. Sometimes spinlock lockup or already 
unlocked reported, so it is quite random.

Reproducing can take from half hour up to few days. We are using stress-ng with 
options:
stress-ng --cpu 1 --vm 3 --vm-bytes 64M --fork 4

In our tests we have noticed that kernel configuration affect frequency of the 
problem. So far we haven't seen any with omap2plus_defconfig, but with slimmer 
defconfig like the one we are using for our board we can get it in few hours. 
We bisected our defconfig and omap2plus_defconfig, but couldn't pinpoint any 
specific config that would cause these problems: it just got less frequent 
until stopped occurring. To rule out any bad behaving drivers, we basically 
disabled everything but serial and it just kept crashing.

Someone was having quite similar problems back in 2012, but other than that 
we've found nothing:
http://thread.gmane.org/gmane.linux.ports.arm.omap/78039/

Anyone seen this kind of issues before? Any ideas what might cause this?

Thanks,
Markku

[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 3.18.24 (markku@thinkpad) (gcc version 4.9.3 
20141031 (prerelease) (Linaro GCC 2014.11) ) #2 PREEMPT Wed Nov 4 09:51:36 EET 
2015
[0.00] CPU: ARMv7 Processor [411fc087] revision 7 (ARMv7), cr=10c5387d
[0.00] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing 
instruction cache
[0.00] Machine model: TI AM3517 EVM (AM3517/05 TMDSEVM3517)
[0.00] cma: Reserved 8 MiB at 0x8f40
[0.00] Memory policy: Data cache writeback
[0.00] On node 0 totalpages: 65280
[0.00] free_area_init_node: node 0, pgdat c09be980, node_mem_map 
cfce7000
[0.00]   Normal zone: 512 pages used for memmap
[0.00]   Normal zone: 0 pages reserved
[0.00]   Normal zone: 65280 pages, LIFO batch:15
[0.00]   HighMem zone: 1048574 pages exceeds freesize 0
[0.00] CPU: All CPU(s) started in SVC mode.
[0.00] AM3517 ES1.1 (l2cache sgx neon )
[0.00] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[0.00] pcpu-alloc: [0] 0
[0.00] Built 1 zonelists in Zone order, mobility grouping on.  Total 
pages: 64768
[0.00] Kernel command line: console=ttyO2,115200
[0.00] PID hash table entries: 1024 (order: 0, 4096 bytes)
[0.00] Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
[0.00] Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
[0.00] Memory: 239940K/261120K available (4809K kernel code, 341K 
rwdata, 1816K rodata, 2996K init, 353K bss, 21180K reserved, 0K highmem)
[0.00] Virtual kernel memory layout:
[0.00] vector  : 0x - 0x1000   (   4 kB)
[0.00] fixmap  : 0xffc0 - 0xffe0   (2048 kB)
[0.00] vmalloc : 0xd080 - 0xff00   ( 744 MB)
[0.00] lowmem  : 0xc000 - 0xd000   ( 256 MB)
[0.00] pkmap   : 0xbfe0 - 0xc000   (   2 MB)
[0.00] modules : 0xbf00 - 0xbfe0   (  14 MB)
[0.00]   .text : 0xc0008000 - 0xc0680984   (6627 kB)
[0.00]   .init : 0xc0681000 - 0xc096e000   (2996 kB)
[0.00]   .data : 0xc096e000 - 0xc09c354c   ( 342 kB)
[0.00].bss : 0xc09c354c - 0xc0a1b97c   ( 354 kB)
[0.00] Preemptible hierarchical RCU implementation.
[0.00] NR_IRQS:16 nr_irqs:16 16
[0.00] IRQ: Found an INTC at 0xfa20 (revision 4.0) with 96 
interrupts
[0.00] Clocking rate (Crystal/Core/MPU): 26.0/332/600 MHz
[0.00] OMAP clockevent source: timer2 at 1300 Hz
[0.23] sched_clock: 32 bits at 13MHz, resolution 76ns, wraps every 
330382100403ns
[0.58] OMAP clocksource: timer1 at 1300 Hz
[0.000598] Console: colour dummy device 80x30
[0.000635] Calibrating delay loop... 589.82 BogoMIPS (lpj=294912)
[0.008980] pid_max: default: 32768 minimum: 301
[0.009168] Security Framework initialized
[0.009264] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
[0.009282] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
[0.010313] CPU: Testing write buffer coherency: ok
[0.010936] Setting up static identity map for 0x80496c78 - 0x80496cd0
[0.013878] devtmpfs: initialized
[0.016530] VFP