Re: [linux-sunxi] Kernel crash in "cpu_freq"

Torsten Beyer Mon, 18 Jul 2022 05:13:53 -0700

Hi again,

I believe I found the place. Can you confirm that changing the microvolts 
OPPs in "arch/arm/boot/dts/sun7i-a20.dtsi" is the right place for upping 
the microvolts for lower frequencies?


cheers
-tb

Torsten Beyer schrieb am Montag, 18. Juli 2022 um 09:26:58 UTC+2:

> Hi Samuel,
>
> am stuck trying to figure out how to increase the voltages. Can you point 
> me to some documentation or quickly explain how I would do that?
>
> regards
> Torsten
>
> [email protected] schrieb am Samstag, 16. Juli 2022 um 06:16:16 UTC+2:
>
>> Hi Torsten, 
>>
>> On 7/13/22 3:18 AM, Torsten Beyer wrote: 
>> > Hi all, 
>> > 
>> > I am trying to debug a bug on an open source air navigation box for 
>> > gliders called openvario <https://www.openvario.org/doku.php>. It is 
>> > based on a cubieboard (A20) plus some additional serial connections 
>> > and an optional sensor board for various flight related pressures. 
>> > 
>> > System runs on kernel 5.18.5 generated using Yocto 4.0 kirkstone. The 
>> > system tends to run for a couple of hours and then freezes/crashes. 
>> > At the bottom of this post I have pasted a typical kernel debug 
>> > output once these freezes happen. The crash always happens in the 
>> > cpu_freq driver. If I set cpu frequency to a fixed frequency (setting 
>> > min=max frequency) those crashed disappear. This seems to be a work 
>> > around at the cost of fixing cpu speed. 
>> > 
>> > So it _seems_ the crash is caused by cpu_freq trying to change the 
>> > cpu frequency (at least at some point in time). 
>> > 
>> > To be honest, I am rather clueless on how to go about finding the 
>> > root of this issue, let along fixing it. So I thought, I'd ask around 
>> > here whether this bug somehow looks familiar and may have been 
>> > tackled (or even fixed) previously (didn't find anything, though, via 
>> > the search function). In other words: I am thankful for any hint 
>> > people may be able to give me to get nearer to a fix.  
>>
>> I have not seen something like this before. It looks like hardware 
>> flakiness. Can you provide a disassembly of ccu_div_recalc_rate 
>> from the kernel this splat came from, to confirm my analysis? 
>>
>> > thanks for any pointers 
>> > Torsten 
>> > 
>> > [26996.004010] Unable to handle kernel paging request at virtual 
>> address 08d80050 
>> > [26996.011337] [08d80050] *pgd=00000000 
>> > [26996.014952] Internal error: Oops: 5 [#1] SMP ARM 
>> > [26996.019590] Modules linked in: 
>> > [26996.022663] CPU: 1 PID: 95 Comm: sugov:0 Not tainted 5.18.5 #1 
>> > [26996.028509] Hardware name: Allwinner sun7i (A20) Family 
>> > [26996.033738] PC is at ccu_div_recalc_rate+0x48/0x90 
>> > [26996.038555] LR is at ccu_mux_helper_apply_prediv+0x18/0x1c 
>>
>> The crash is between the calls to ccu_mux_helper_apply_prediv and 
>> divider_recalc_rate, so we are loading arguments for the call to 
>> divider_recalc_rate. 
>>
>> > [26996.044054] pc : [] lr : [] psr: 600b0113 
>> > [26996.050326] sp : f09e5dc8 ip : 00000000 fp : c1938200 
>> > [26996.055554] r10: c1867440 r9 : 1f78a400 r8 : c1302d00 
>> > [26996.060781] r7 : 1312d000 r6 : 1f78a400 r5 : 00000002 r4 : 08d80084 
>>
>> Assuming r4 is "hw", then the faulting address is cd->div.flags. 
>> This is weird because r5 already contains cd->div.width... 
>>
>> > [26996.067311] r3 : 00000000 r2 : ffffffff r1 : 00000001 r0 : 1f78a400 
>>
>> ..and r3 already contains cd->div.table. So we were already able 
>> to access parts of the struct both before and after the faulting 
>> address. 
>>
>> > [26996.073843] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment 
>> none 
>> > [26996.080985] Control: 10c5387d Table: 41ff006a DAC: 00000051 
>> > [26996.086733] Register r0 information: non-paged memory 
>> > [26996.091799] Register r1 information: non-paged memory 
>> > [26996.096858] Register r2 information: non-paged memory 
>> > [26996.101915] Register r3 information: NULL pointer 
>> > [26996.106627] Register r4 information: non-paged memory 
>> > [26996.111688] Register r5 information: non-paged memory 
>> > [26996.116746] Register r6 information: non-paged memory 
>> > [26996.121805] Register r7 information: non-paged memory 
>> > [26996.126863] Register r8 information: slab kmalloc-128 start c1302d00 
>> pointer offset 0 size 128 
>> > [26996.135514] Register r9 information: non-paged memory 
>> > [26996.140574] Register r10 information: slab task_struct start 
>> c1867440 pointer offset 0 
>> > [26996.148517] Register r11 information: slab kmalloc-128 start 
>> c1938200 pointer offset 0 size 128 
>> > [26996.157244] Register r12 information: NULL pointer 
>> > [26996.162049] Process sugov:0 (pid: 95, stack limit = 0xf4bf205c) 
>> > [26996.167985] Stack: (0xf09e5dc8 to 0xf09e6000) 
>> > [26996.172361] 5dc0: c0d81584 c03db530 00000000 1f78a400 c1355700 
>> c03d181c 
>>
>> What I think is happening is that the value in r4 got corrupted from 
>> 0xc0d81584 (the saved value on the top of the stack) to 0x08d80084. 
>>
>> Can you try increasing the voltage of the lower OPPs by 100 mV? And 
>> if that doesn't work, try setting all of the OPPs to 1.4 V. That 
>> should rule out any instability due to an insufficient CPU supply 
>> voltage, and also due to any delay in slewing the regulator output. 
>>
>> Regards, 
>> Samuel 
>>
>> > [26996.180547] 5de0: c1355600 c1355700 1f78a400 c03d34ec 00000000 
>> c1355600 1f78a400 39387000 
>> > [26996.188733] 5e00: c1302d00 1f78a400 c1867440 c03d3554 00000000 
>> c1302d00 016e3600 39387000 
>> > [26996.196917] 5e20: c1302d00 1f78a400 c1867440 c03d3554 c1355600 
>> 00000000 1f78a400 c1867440 
>> > [26996.205101] 5e40: c1302d00 1f78a400 c1867440 c03d39f0 1f78a400 
>> 00000000 ffffffff 1f78a400 
>> > [26996.213287] 5e60: c0d81bd0 df7bf617 c193a340 1f78a400 1f78a400 
>> c1938300 ef7dc050 1f78a400 
>> > [26996.221474] 5e80: c1867440 c03d3c28 c18b3b00 c1938500 1f78a400 
>> c1938300 ef7dc050 c06122a4 
>> > [26996.229659] 5ea0: c1938300 00000001 ffffffff df7bf617 c0d81bd0 
>> c18b3b00 ef7dc050 1f78a400 
>> > [26996.237844] 5ec0: 00000007 c1867440 c1938500 c0db652c 00080e80 
>> c0612674 00000000 c0db617c 
>> > [26996.246030] 5ee0: 1f78a400 df7bf617 c1812800 c1812800 00000000 
>> c0dfd944 000ea600 00000000 
>> > [26996.254214] 5f00: 00000002 c0617054 00000001 c1867440 00000000 
>> 00000000 f09e5f5c c1812800 
>> > [26996.262400] 5f20: 000ea600 00080e80 00000024 df7bf617 00000004 
>> c184ba00 c184ba14 00000000 
>> > [26996.270585] 5f40: 00080e80 c184ba2c 00000001 c0a34650 00000000 
>> c0159c98 00000000 c184ba28 
>> > [26996.278770] 5f60: c1867440 c0dea144 c184ba2c c0136954 c193a500 
>> c1867440 c01368e0 c184ba28 
>> > [26996.286955] 5f80: c13c2100 f0891c44 00000000 c0138194 c193a500 
>> c01380c4 00000000 00000000 
>> > [26996.295138] 5fa0: 00000000 00000000 00000000 c0100148 00000000 
>> 00000000 00000000 00000000 
>> > [26996.303321] 5fc0: 00000000 00000000 00000000 00000000 00000000 
>> 00000000 00000000 00000000 
>> > [26996.311505] 5fe0: 00000000 00000000 00000000 00000000 00000013 
>> 00000000 00000000 00000000 
>> > [26996.319695] ccu_div_recalc_rate from clk_recalc+0x34/0x78 
>> > [26996.325215] clk_recalc from clk_change_rate+0xa4/0x29c 
>> > [26996.330461] clk_change_rate from clk_change_rate+0x10c/0x29c 
>> > [26996.336226] clk_change_rate from clk_change_rate+0x10c/0x29c 
>> > [26996.341991] clk_change_rate from 
>> clk_core_set_rate_nolock+0x16c/0x234 
>> > [26996.348539] clk_core_set_rate_nolock from clk_set_rate+0x30/0x154 
>> > [26996.354741] clk_set_rate from _set_opp+0x268/0x550 
>> > [26996.359644] _set_opp from dev_pm_opp_set_rate+0xe8/0x20c 
>> > [26996.365062] dev_pm_opp_set_rate from 
>> __cpufreq_driver_target+0x584/0x6e4 
>> > [26996.371876] __cpufreq_driver_target from sugov_work+0x48/0x54 
>> > [26996.377741] sugov_work from kthread_worker_fn+0x74/0x1a4 
>> > [26996.383167] kthread_worker_fn from kthread+0xd0/0xec 
>> > [26996.388242] kthread from ret_from_fork+0x14/0x2c 
>> > [26996.392967] Exception stack(0xf09e5fb0 to 0xf09e5ff8) 
>> > [26996.398032] 5fa0: 00000000 00000000 00000000 00000000 
>> > [26996.406216] 5fc0: 00000000 00000000 00000000 00000000 00000000 
>> 00000000 00000000 00000000 
>> > [26996.414398] 5fe0: 00000000 00000000 00000000 00000000 00000013 
>> 00000000 
>> > [26996.421027] Code: e0055231 e244102c e3e02000 eb0001f3 (e5143034) 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/linux-sunxi/55447865-3855-4b73-bbe1-84808ab99a82n%40googlegroups.com.

Re: [linux-sunxi] Kernel crash in "cpu_freq"

Reply via email to