Re: [osv-dev] AArch64 debug build woes

Waldek Kozaczuk Mon, 15 Feb 2021 08:33:01 -0800

On Mon, Feb 15, 2021 at 02:19 Nadav Har'El <[email protected]> wrote:


> On Mon, Feb 15, 2021 at 7:43 AM Waldek Kozaczuk <[email protected]>
> wrote:
>
>>
>>
>> On Sunday, February 14, 2021 at 2:33:16 PM UTC-5 Nadav Har'El wrote:
>>
>>>
>>> You seem to be pushing registers on the stack here. Where is this stack?
>>> In x86, we had separate stacks for exceptions, for nested exceptions, and
>>> interrupts.
>>> Is this also true in the arm version?
>>>
>>
>> Good question. So in our aarch64 port, there is no dedicated exception
>> nor interrupt stack unlike in x64. There is a single stack per thread where
>> everything happens. And this might be the issue.
>>
> But somehow it works with '-O2' but maybe some bugs we have for aarch64
>> which we do not understand do happen because of a single stack.
>>
>
> As usual this is just a wild theory, but it's possible that O2 code uses
> fewer or more registers, or uses the stack more or less or differently.
>
Right.

>
>
>>
>>> This discussion of the stack made me think of another possible reason
>>> for losing data in functions.
>>> The red zone.
>>> Do we have a "red zone" on arm64 as well?
>>> Basically, the "red zone" is 128 bytes below the stack pointer that a
>>> function can use as scratch space, and it can use it for example to store
>>> some of the parameters if it needs the registers to store something else -
>>> without wasting time on instructions to change the stack pointer. If some
>>> interrupt or exception overwrites this redzone, we lose data.
>>> To avoid this, we usually had separate stacks for interrupts and
>>> exceptions and nested exceptions, but where we didn't want to do this,
>>> e.g., in syscalls, we had to skip the redzone (see for example commit
>>> 499b9433ae748b6c04dedc2125ea17010ffbdaf1).
>>>
>>
>> The ARMv8  Procedure Call Standard -
>> https://developer.arm.com/documentation/ihi0055/latest/- does not
>> mention any "red zone". However, both Windows (
>> https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#red-zone)
>> and Apple (
>> https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms)
>> extensions both mention the red zone - 16 bytes and 128 bytes respectively.
>> I could not find anything specific for Linux though. For sure Linux
>> requires 128 red zone for x86_64.
>>
>
> I suspect that gcc does *not* use the red zone on aarch64 - it seems there
> is no "-mno-red-zone" option, and
> https://github.com/iains/gcc-darwin-arm64 suggests that unlike Apple's
> compiler, gcc doesn't use a red zone - and your experiment below also
> suggests that this is not the problem. So that's probably (hopefully) not
> the problem, so there is one less reason to use separate stacks.
>
> By the way another consequence of using the user's stacks for interrupts,
> exceptions, etc., is that it becomes more important for all stacks to be
> big enough. I wonder if the problem could be that some of our thread stacks
> are too small. Maybe you can hack sched::thread to always use a larger
> minimum stack size and see if it helps?
>

So here is what I tried at the same time:
1) Enforced eager resolving of symbols (for one type of the crash).
2) Added a space of 128 bytes (tried 256, 512 even 4K) on the stack BEFORE
pushing the exception frame for both interrupts and other exceptions.
3) Added a space of 256 bytes on the stack AFTER pushing the exception
frame for both interrupts and exceptions.
4) Doubled the size of the default stack from 64K to 128.

Same crashes as before. Different for O0 and O1 as before my changes. So
none of the above helps or changes anything.

So my theory was that either the exception frame would overwrite portion of
the stack where registers are restored from, or something overwrites part
of the exception frame. But the experiments above seem to prove that this
may not be the case.

Then what else? Compiler bug? I will try with older version of 9.3.

>
>
>
>>
>> Now I devised a simple experiment and subtracted 128 bytes from sp at the
>> beginning of push_frame and added 128 bytes at the end of the pop_frame
>> macros that did not help in any way. I also tried 256, 512 to the same
>> effect.
>>
>>
>>
>> I have another wild guess below - caller-saved registers:
>>>
>>> The "typical" problem here (I don't know if it happens in your case, but
>>> it happened in the past in various cases)
>>> is that "something" (interrupt, exception, signal, etc.) gets called *in
>>> the middle *of the user's function code, so he
>>> did not know he was going to call a function, so it didn't save these
>>> caller-saved registers. This is why all of that
>>> asynchronous code needs to save all those caller-saved registers. In
>>> x86, we had these problems with the FPU
>>> and had to save the FPU state in a bunch of places. Maybe in aarch64 we
>>> need to save additional registers in
>>> the same place we saved the FPU state for x86?
>>>
>> There is the arm64 FPU save/restore code in OSv where it saves floating
>> point registers. But are you suggesting we save/restore extra registers in
>> there? But why if push_frame/pop_frame do that for us already when the
>> interrupt is taken?
>>
>
> OSv does this FPU save/restore in *more* than just interrupts. We also
> have exceptions, signal handlers, and SYSCALL, all of which can wind up
> calling code in the middle of user code. So the first code that leaves the
> user's code needs to save these registers. It sounds like you're doing this
> correctly for interrupts, but maybe it's missing for some other things?
>
> That being said, if this were something as "obvious" as not saving the
> registers, I would suspect this would have been more obvious and more
> frequent, and not specific to O1. So maybe that's not the problem.
>
>
>>
>> Now in arm64, which is RISC, the stack is manipulated very differently
>> than in x64, and very often storing or reading from the stack does not
>> touch the stack pointer but merely references it, and then at some point,
>> it is adjusted accordingly (it may happen before). So I wonder if the
>> interrupt is called in the middle of that before the stack pointer is
>> adjusted we might be pushing the frame at the wrong place and overriding
>> some registers.
>>
>
> Hmm...
>
>
>> But then my experiment with adding/subtracting 128 or 256 bytes as I
>> described above should have helped.
>>
>
> Yes, it sounds like it would.
>
>
>>
>> BTW compiling with 'O1 -fcaller-saves' makes the crash happen in another
>> place.
>>
>>>
>>>
>>>> * call a function
>>>> * Restore ant x0-x18 registers if saved
>>>>
>>>
>>>>
>>>> Callee:
>>>> * push lr and any x19-x30 registered if used on stack
>>>> * execute code
>>>> * pop any x19-x30 registered if used above from stack
>>>>
>>>> Waldek
>>>>
>>>>>
>>>>>> I saw another test crashing in a similar way when the caller (another
>>>>>> test) would pass 3 arguments to kernel function and 2 of those
>>>>>> (non-addresses) were passed correctly but the 3rd one - address one was 
>>>>>> not.
>>>>>>
>>>>>>
>>>>>> Any ideas what might be going on?
>>>>>>
>>>>>>
>>>>>> Waldek
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "OSv Development" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/osv-dev/4a97809f-d207-48b9-88e7-06e218e5d829n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/osv-dev/4a97809f-d207-48b9-88e7-06e218e5d829n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>> You received this message because you are subscribed to the Google Groups
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/osv-dev/0ed4ddda-9815-4032-ba90-cef9a2cb3cddn%40googlegroups.com
>> <https://groups.google.com/d/msgid/osv-dev/0ed4ddda-9815-4032-ba90-cef9a2cb3cddn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CAL9cFfMtoq3k9Hm2oK9B8Y7fQr_M%2BouMw7ZmVMQB787gocs7%2Bg%40mail.gmail.com.

Re: [osv-dev] AArch64 debug build woes

Reply via email to