Em Tue, 30 Jul 2024 13:17:09 +0200 Igor Mammedov <imamm...@redhat.com> escreveu:
> On Mon, 22 Jul 2024 08:45:56 +0200 > Mauro Carvalho Chehab <mchehab+hua...@kernel.org> wrote: > > that's quite a bit of code that in 99% won't ever be used > (assuming error injection testing scenario), > not to mention it's a hw depended one and governed by different specs. > > Essentially we would need to create _whole_ lot of QAPI > commands to cover possible errors for no benefit to QEMU. > > Let take for example very simple _OST status reporting, > QEMU of cause can decode values and present it to users in > more 'presentable' form. However instead of translating > numbers (aka. spec language) into a made up QEMU language, > QEMU just passes values up the stack and users can use > well defined spec to interpret its meaning. > > benefits are: QEMU doesn't have to maintain translation > code and QAPI ABI is limited to passing raw values. > > Can we do similar thing here as well? > i.e. simplify error injection commands to > a command that takes raw value and passes it > to guest (QEMU here acts as proxy, if I'm not > mistaken)? > > Preferably make it generic enough to handle > not only ARM but other error formats HEST is > able to handle. A too generic interface doesn't sound feasible to me, as the EINJ code needs to check QEMU implementation details before doing the error inject. See, processor is probably the simplest error injection source, as most of the fields there aren't related to how the hardware simulation is done. Yet, if you see patch 7 of this series, you'll notice that some fields should actually be filled based on the emulation. On ARM, we have some IDs that depend on the emulation (MIDR, MPIDR, power state). Doing that on userspace may require a QAPI to query them. The memory layout, however, is the most complex one. Even for an ARM processor CPER (which is the simplest scenario), the physical/virtual address need to be checked against the emulation environment. Other error sources (like memory errors, CXL, etc) will require a deep knowledge about how QEMU mapped such devices. So, in practice, if we move this to an EINJ script, we'll need to add a probably more complex QAPI to allow querying the memory layout and other device and CPU specific bindings. Also, we don't know what newer versions of ACPI spec will reserve us. See, even the HEST table contents is dependent of the HEST revision number, as made clear at the ACPI 6.5 notes: https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source and at: https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward So, if we're willing to add support for a more generic "raw data" QAPI, I would still do it per-type, and for the fields that won't require knowledge of the device-emulation details. Btw, my proposal on patch 7 of this series is to have raw data for: - the error-info field; - registers dump; - micro-architecture specific data. I don't mind trying to have more raw data there as I see (marginal) benefits of allowing to generate CPER invalid records [1], but some of those fields need to be validated and/or filled internally at QEMU - if not forced to an specific value by the caller. [1] a raw data EINJ can be useful for fuzzy logic fault detection to check if badly formed packages won't cause a Kernel panic or be an exploit. Yet, not really a concern for APEI, as if the hardware is faulty, a Kernel panic is not out of the table. Also, if the the BIOS is already compromised and has malicious code on it, the EINJ interface is not the main concern. > PS: > For user convenience, QEMU can carry a script that > could help generate this raw value in user friendly way > but at the same time it won't put maintenance > burden on QEMU itself. The script will still require reviews, and the same code will be there. So, from maintenance burden, there won't be much difference. Btw, I'm actually using myself a script to test it, currently sitting together with rasdaemon - which is the Linux tool to detect and handle hardware errors: https://github.com/mchehab/rasdaemon/blob/master/contrib/qemu_einj.py as it helps a lot when trying to simulate more complex errors. Once QEMU gains support to inject processor errors, I can prepare a separate patch to move it to QEMU. Thanks, Mauro