Hi Geng,

On 16/04/2020 13:07, gengdongjiu wrote:
> On 2020/4/14 20:18, James Morse wrote:
>> On 11/04/2020 13:17, Dongjiu Geng wrote:
>>> When the RAS Extension is implemented, b0b011000, 0b011100,
>>> 0b011101, 0b011110, and 0b011111, are not used and reserved
>>> to the DFSC[5:0] of ESR_ELx, but the code still checks these
>>> unused bits, so remove them.
>>
>> They aren't unused: CPUs without the RAS extensions may still generate these.
>>
>> kvm_handle_guest_abort() wants to know if this is an external abort.
>> KVM doesn't really care if the CPU has the RAS extensions or not, its the 
>> arch code's job
>> to sort all that out.
> 
> No, handle_guest_sea() ---> ghes_notify_sea  ---> apei driver

You missed the arch code's apei_claim_sea(). This is where kvm washes its hands 
of the
exception, its up to the arch code to deal with, in the same way it deals with 
errors for
user-space.


> If it is an  external abort, it will call apei driver to handle it, but it 
> should be only SEA will call the apei driver.
> other type of external abort should not call apei driver.
> I am not see arch code sort all that out.

The other kind is _asynchronous_ external abort, which doesn't get handled in 
here.

Do you mean 'Synchronous external abort on translation table walk, nth level'?
KVM shouldn't care, this is still up to the arch code to deal with.

Do you mean 'Synchronous parity or ECC error on memory access, not on 
translation table
walk'? (and all its translation table walk friends...)

These really are still external abort!
See shared/functions/aborts/IsExternalAbort() in the psuedo code. (J12-7735 of 
DDI0487F.a)

This means they are trapped by SCR_EL3.EA. On a firmware-first system if we see 
one of
these, its because firmware re-injected it. The arch code needs to get APEI to 
check for
CPER records when it sees one. (KVM ... doesn't care)

(I agree not having 'external' in the name is counter-intuitive. I think this 
is because
the component that triggers them is 'in' the CPU, (e.g. the L1 cache). This is 
how you can
know it was a parity or ECC error, instead of 'something outside' (i.e. 
external). An OS
that is deeply tied to the CPU micro-architecture could use the difference to 
read imp-def
sysregs for one, and poke around in imp-def mmio for the external case. Linux 
isn't tied
to the micro-architecture, so ignores this 'internal/external' hint.)


>>> If the handling of guest ras data error fails, it should
>>> inject data instead of SError to let the guest recover as
>>> much as possible.
> 
> In some hardware platform, it supports RAS, but the RAS error address will be 
> not recorded,

In what circumstances does this happen?

Wouldn't these errors all be uncontained? (in which case the host should 
panic()).
e.g. A write went to the wrong address, we don't know where it went...

...

Is this because your platform describes memory-errors that happen inside the 
CPU as
processor errors, instead of memory-errors?

Linux's APEI code ignores them, because it doesn't know what they are. This 
means you are
missing the whole memory_failure(), page-unmapping, user-space signalling ... 
and guest
error injection ... part of the RAS handling.


> so it is better to inject a data abort instead of SError for thtat platform.

Not at all. The SError here is KVM's response to having nothing else it can do.

Having the host handle the error is the best way of solving the problem.

> because guest will try to do recovery for the Synchronous data abort, such as 
> kill the error
> application. But for SError, guest will be panic.

I'd rather we fix this by having the host handle the error.

Botching in a synchronous exception here would mean Qemu can't properly emulate 
a machine
that has SCR_EL3.EA set ... which you need to provide firmware-first for the 
guest.

[...]

> Yes, some platform supports RAS but will not record the error address, so the 
> host has failed
> to handle the error.

I don't think its reasonable to expect KVM to do anything useful in this case.
We should fix the host error handling.

In what circumstances does this happen?


Thanks,

James
_______________________________________________
kvmarm mailing list
[email protected]
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Reply via email to