Hello Shuai, On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: > > > APEI does not define an error type named GHES. GHES is just a kernel > > > driver name. Many hardware error types can be handled in GHES (see > > > ghes_do_proc), for example, AER is routed by GHES when firmware-first > > > mode is used. As far as I know, firmware-first mode is commonly used in > > > production. Should GHES errors be categorized into AER, memory, and CXL > > > memory instead? > > > > I also considered slicing the data differently initially, but then > > realized it would add more complexity than necessary for my needs. > > > > If you believe we should further subdivide the data, I’m happy to do so. > > > > You’re suggesting a structure like this, which would then map to the > > corresponding CPER_SEC_ sections: > > > > enum hwerr_error_type { > > HWERR_RECOV_AER, // maps to CPER_SEC_PCIE > > HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE > > CPER_SEC_PCIE is typo?
Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors coming from GHES. > > HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_* > > HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM > > } > > > > Additionally, what about events related to CPU, Firmware, or DMA > > errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we > > include those in the classification as well? > > I would like to split a error from ghes to its own type, > it sounds more reasonable. I can not tell what happened from > HWERR_RECOV_AERat all :( Makes sense. Regarding your answer, I suppose we might want to have something like the following: enum hwerr_error_type { HWERR_RECOV_MCE, // maps to errors in do_machine_check() HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_ HWERR_RECOV_PCI, // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM_ HWERR_RECOV_CPU, // maps to CPER_SEC_PROC_ HWERR_RECOV_DMA, // maps to CPER_SEC_DMAR_ HWERR_RECOV_OTHERS, // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, } Is this what you think we should track? Thanks --breno