> On 2022/Apr/25, at 14:09, Sergei Trofimovich <sly...@gmail.com> wrote:
> 
> On Mon, 25 Apr 2022 15:07:58 +0000
> Pedro Miguel Justo <pm...@texair.net> wrote:
> 
>>> On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pm...@texair.net> wrote:
>>> 
>>> 
>>> 
>>>> On 2022/Apr/25, at 01:14, Frank Scheiner <frank.schei...@web.de> wrote:
>>>> 
>>>> Hi guys,
>>>> 
>>>> On 25.04.22 10:09, John Paul Adrian Glaubitz wrote:  
>>>>>> From what I can understand by the information in the bugcheck, this is 
>>>>>> somewhat related to a violation
>>>>>> in parameter copy from user to kernel during some boot-time, crypto, 
>>>>>> self-test. Does that sound right?
>>>>>> If that is the case, how would this be related to FW?  
>>>>> 
>>>>> I'm not claiming that it must be related to the firmware, I'm just saying 
>>>>> that I don't see this problem
>>>>> on my RX2660 at all and I have even reinstalled it recently with one of 
>>>>> the latest firmware images
>>>>> without having to pass any parameter to the command line.  
>>>> 
>>>> A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito
>>>> left and Montvale right.
>>>> 
>>>> But could still be multiple other reasons we haven't looked at yet in
>>>> detail:
>>>> 
>>>> * amount of memory installed
>>>> * SMT enabled or not
>>>> * number of processor modules installed
>>>> 
>>>> It might be possible for me to check on my rx2660s (one with Montvale
>>>> and one with Montecito(s)) tomorrow. I will then also look at my other
>>>> Itanium gear and gather relevant information.
>>>> 
>>> 
>>> Yes, this sounds mode likely to me too.
>>> 
>>> The crypto self-tests seem to be an innocent bystander here. I tried 
>>> booting the most recent kernel with the option “cryptomgr.notests” and it 
>>> went much farther. Alas it still failed with another buffer copy validation 
>>> for a different caller altogether:
>>> 
>>> [    3.836466]  [<a000000101353690>] usercopy_abort+0x120/0x130
>>> [    3.836466]                                 sp=e0000001000cfdf0 
>>> bsp=e0000001000c9388
>>> [    3.836466]  [<a0000001004c5660>] __check_object_size+0x3c0/0x420
>>> [    3.836466]                                 sp=e0000001000cfe00 
>>> bsp=e0000001000c9350
>>> [    3.836466]  [<a000000100570030>] sys_getcwd+0x250/0x420
>>> [    3.836466]                                 sp=e0000001000cfe00 
>>> bsp=e0000001000c92c8
>>> [    3.836466]  [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20
>>> [    3.836466]                                 sp=e0000001000cfe30 
>>> bsp=e0000001000c92c8
>>> [    3.836466]  [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400
>>> [    3.836466]                                 sp=e0000001000d0000 
>>> bsp=e0000001000c92c8
>>> 
>>> This suggests the bug might be in the logic validating these buffers 
>>> against the allocations (heap, span, etc).
>>> 
>>> I don’t know why hardened_usercopy=off is not being observed by the kernel. 
>>> As a work-around I am copying myself a new kernel with 
>>> CONFIG_HARDENED_USERCOPY disabled at the source. 
>>> 
>> 
>> Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 
>> (2019-06-19) ia64 GNU/Linux"
>> 
>> Things are still not 100%. After a few hours into building the kernel it 
>> started crashing also with usercopy validations but, this time, the other 
>> way around. And because it was the other way around, it led to process 
>> termination instead of full-blown bugcheck. This could be related or not. 
>> Coule very well be a different bug that happens to manifest itself round the 
>> same validation.
>> 
>>  CC [M]  drivers/net/wireless/realtek/rtw88/rtw8822be.o
>>  LD [M]  drivers/net/wireless/realtek/rtw88/rtw88_8822be.o
>>  CC [M]  drivers/net/wireless/realtek/rtw88/rtw8822c.o
>> Segmentation fault
>> make[5]: *** [scripts/Makefile.build:293: 
>> drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139
>> make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o'
>> make[4]: *** [scripts/Makefile.build:555: 
>> drivers/net/wireless/realtek/rtw88] Error 2
>> make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] 
>> Error 2
>> make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2
>> make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2
>> make: *** [Makefile:1855: drivers] Error 2
>> pmsjt@debian:~/linux-source-5.17$ make
>> 
>> Message from syslogd@debian at Apr 25 07:58:08 ...
>> kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to 
>> linear kernel text (offset 1916912, size 8)!
>> 
>> Message from syslogd@debian at Apr 25 07:58:08 ...
>> kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to 
>> linear kernel text (offset 1818608, size 8)!
>>  HOSTCC  scripts/sign-file
>>  CALL    scripts/checksyscalls.sh
>> <stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp]
>>  CALL    scripts/atomic/check-atomics.sh
>>  CHK     include/generated/compile.h
>> make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] 
>> Segmentation fault
>> 
>> Message from syslogd@debian at Apr 25 07:58:11 ...
>> kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to 
>> linear kernel text (offset 1933296, size 8)!
>> make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2
>> make: *** [Makefile:1855: arch/ia64] Error 2
> 

Hi Sergei

> In my understanding hardened_usercopy=on is completely broken on ia64
> today. It can't run any userspace. Even init process would not survive
> machine boot. At least that's what I experienced on rx3600.
> 
> Thus I think if your system survives that much time I would guess
> that you have hardened_usercopy=off in full effect at least at boot.
> 

I want to make sure there is no confusion here. My system only ’survives’ this 
much when I am using the 4.19 kernel (even when the hardened_usercopy=off is 
not present). With kernels more recent than that the system will bugcheck very 
early on boot even if hardened_usercopy=off is present.

> I would speculate it's some kind of memory corruption around
> 'bypass_usercopy_checks' key.
> 
> Worth adding a few printk()s to mm/usercopy.c into 'usercopy_abort()'
> and into 'set_hardened_usercopy()' just to make sure 'bypass_usercopy_checks'
> has expected 'true' setting at boot time and at crash time.

Right - we definitively need more context about what is the root cause and 
characteristics of the bug. When the failure happens, is the (pointer, range) 
of the copy really out-of-whack, or is the validation code not making sense of 
the boundaries and over-actively failing.

> 
> -- 
> 
>  Sergei

Reply via email to