I am sponsoring this case on behalf of Gavin Maltby.  This case seeks patch
binding, with the generic MCA FMA support it delivers targeting a Solaris
update release.  The case straightforwardly extends PSARC/2006/020 and
PSARC/2006/564 and adds FMA fault and ereport events suitable for generation
and consumption on all x86 platforms.

Cindi

This information is Copyright 2007 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Generic x86 Machine Check Architecture FMA
    1.2. Name of Document Author/Supplier:
         Author:  Gavin Maltby
    1.3  Date of This Document:
        10 October, 2007
4. Technical Description

4.1. Project Summary
   4.1.1 Project Description:

        CPU, and in some cases memory, error report telemetry falls
        under the Machine Check Architecture which all recent AMD and
        Intel x86 chip offerings conform to.  Existing FMA support for x86
        chips is limited to AMD family 0xf alone - Opteron/Athlon64/Turion64
        revisions B to F that have shipped in Sun x64 systems to date.
        In expanding this support to include the upcoming AMD family 0x10
        and new Sun Intel-based x64 products we have implemented a generic
        machine check architecture FMA implementation that supports all
        chips that support the MCA, and have layered the existing AMD
        support and additional model-specific support on top of that.
        This case presents the new FMA events that are defined for the
        new generic MCA implementation, and modifies the definitions
        of existing AMD events.  The associated FMA portfolio referenced
        below has already been approved at FMA portfolio review.

    4.1.2 Details:

        The existing implementation for AMD family 0xf delivered
        a "cpu module" cpu.AuthenticAMD.15 which contains all the smarts
        for error handling, error classification, error logging etc.
        The telemetry arising from this module is consumed by the
        eversholt diagnosis engine, applying a set of AMD-specific rules.
        For any other chip type (i.e., not AMD family 0xf) a dumb
        cpu.generic module provided token support for machine check traps,
        and raises no telemetry for diagnosis.

        In the new implementation most of the existing AMD code is
        refactored into an improved and fully FMA-aware cpu.generic.
        Model-specific aspects such as the more-detailed error classification
        possible with AMD chips (i.e., classification beyond what is
        considered "architectural" in the generic MCA) is performed
        in a relatively lightweight "model-specific cpu module" that is
        layered on top of cpu.generic.  The following model-specific
        modules will be delivered:

        Module                  Description
        ----------------------- ---------------------------------------
        cpu_ms.AuthenticAMD.15  AMD family 0xf model-specific support
        cpu_ms.AuthenticAMD     "Generic AMD" model-specific support
        cpu_ms.GenuineIntel     "Generic Intel" model-specific support

        The following support combinations are possible:

        a) cpu.generic absent

                Ereport classes: None
                Fault classes:

                The system will have zero MCA capability and no FMA event
                will be raised for cpu or memory.

        b) cpu.generic with no model-specific module

                Ereport classes: ereport.cpu.generic-x86.*
                Fault classes:   fault.cpu.generic-x86.*

                This applies on chips that are not from AMD or Intel.  It also
                applies if no AMD or Intel model-specific support initializes,
                which could occur on an error or on disabling and would
                also occur for new chip families from AMD and Intel that
                not even the "generic AMD" and "generic Intel" model-specific
                modules are able to claim support for.

                Telemetry will be raised entirely by cpu.generic in the
                ereport.cpu.generic-x86.* ereport class, with no model-specific
                augmentation.  It will be diagnosed by a corresponding set
                of eversholt rules to produce faults in the class
                fault.cpu.generic-x86.*.

        c) cpu.generic with cpu_ms.AuthenticAMD.15

                Ereport classes: ereport.cpu.amd.*
                Fault classes:   fault.cpu.amd.*, fault.memory.*

                This applies to AMD family 0xf systems, i.e. it is the
                combination that replaces cpu.AuthenticAMD.15 that
                is the only current Solaris MCA support for x86 systems

                To maintain compatibility, this combination will produce
                ereports in the precisely the same classes as before -
                ereport.cpu.amd.*, and the existing AMD eversholt rules
                will diagnose these to fault.cpu.amd.* and fault.memory.amd.*.
                This is achieved by cpu.generic allowing model-specific
                support to classify the error and provide the ereport class
                to use.  The ereport payload *is* changed since most of it
                is now generated by cpu.generic with only a limited degree
                of augmentation from the model-specific support;  this change
                is limited to the renaming of a number of ereport payload
                members, and the addition of a number of new members.  The
                few AMD diagnosis rules that access payload members are
                changed to use the new names.  In the events defintions,
                existing AMD ereports are retained as payload version 0
                while the revised (renamed, extended) payload is introduced
                as payload version 1.

        d) cpu.generic with cpu_ms.AuthenticAMD

                Ereport classes: ereport.cpu.generic-x86.*
                Fault classes:   fault.cpu.generic-x86.*,
                                 fault.memory.generic-x86.*

                This will apply on AMD systems where no more-specific
                model-specific support exists or initializes;  this
                will include AMD family 0x10 systems.

                Most telemetry is provided entirely by cpu.generic.
                Since these AMD systems include an on-chip memory-controller
                which falls under the MCA umbrella we can also
                recognise and diagnose memory errors, and this is
                done in the model-specific module.

        e) cpu.generic with cpu_ms.GenuineIntel

                Ereport classes: ereport.cpu.intel.*
                Fault classes:   fault.cpu.intel.*,

        This case presents the FMA events for a) - d).  A separate
        case will present the events for e) along with additional
        Intel memory-controller-hub events.

        The accompanying erecheck.html details all changes to the
        SMI Events Registry.

        The associated FMA portfolio is here:

        
http://wikihome.sfbay.sun.com/fma-portfolio/Wiki.jsp?page=2007.025.x86MCA

        The portfolio includes full philosophy and diagnosis documents.

    4.5. Interfaces:

        The "cpu module interface" was introduced by the AMD FMA work.
        It is a project-private interface.  The current work substantially
        revises this interface and, in particular, prepares it to work
        in the presence of a hypervisor such as Solaris xVM (changing it
        from a "cpu_t interface" to a "chip/core/strand interface".
        The interface remains project-private.
    
    4.9. I18N/L10N Impact:

        A new FMA dictionary and .po/.mo "GMCA" are delivered.
        
5. Reference Documents:

PSARC/2006/564 FMA for Athlon 64 and Opteron Rev F/G Processors
PSARC/2006/020 FMA for Athlon 64 and Opteron Processors

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: Automatic
    6.6. ARC Exposure: open


Reply via email to