Commit:     87f24c3ac399e82c578e71311251f00618fc5203
Parent:     b2a4ac0c2860b27670bce99e8c9c281bf431c272
Author:     Doug Thompson <[EMAIL PROTECTED]>
AuthorDate: Thu Jul 19 01:50:34 2007 -0700
Committer:  Linus Torvalds <[EMAIL PROTECTED]>
CommitDate: Thu Jul 19 10:04:57 2007 -0700

    drivers/edac: add to edac docs
    Updated the EDAC kernel documentation
    Signed-off-by:      Doug Thompson <[EMAIL PROTECTED]>
    Cc: Alan Cox <[EMAIL PROTECTED]>
    Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
    Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
 Documentation/drivers/edac/edac.txt |  192 ++++++++++++++++++++++++++++++-----
 1 files changed, 165 insertions(+), 27 deletions(-)

diff --git a/Documentation/drivers/edac/edac.txt 
index 3c5a9e4..a5c3684 100644
--- a/Documentation/drivers/edac/edac.txt
+++ b/Documentation/drivers/edac/edac.txt
@@ -2,22 +2,42 @@
 EDAC - Error Detection And Correction
-Written by Doug Thompson <[EMAIL PROTECTED]>
+Written by Doug Thompson <[EMAIL PROTECTED]>
 7 Dec 2005
+17 Jul 2007    Updated
-EDAC was written by:
-       Thayne Harbaugh,
-       modified by Dave Peterson, Doug Thompson, et al,
-       from the project.
+EDAC is maintained and written by:
+       Doug Thompson, Dave Jiang, Dave Peterson et al,
+       original author: Thayne Harbaugh,
+       website:
+       mailing list:   [EMAIL PROTECTED]
+"bluesmoke" was the name for this device driver when it was "out-of-tree"
+and maintained at  When it was pushed into 2.6.16 for the
+first time, it was renamed to 'EDAC'.
+The bluesmoke project at is now utilized as a 'staging area'
+for EDAC development, before it is sent upstream to
+At the bluesmoke/EDAC project site, is a series of quilt patches against
+recent kernels, stored in a SVN respository. For easier downloading, there
+is also a tarball snapshot available.
 The 'edac' kernel module goal is to detect and report errors that occur
-within the computer system. In the initial release, memory Correctable Errors
-(CE) and Uncorrectable Errors (UE) are the primary errors being harvested.
+within the computer system running under linux.
+In the initial release, memory Correctable Errors (CE) and Uncorrectable
+Errors (UE) are the primary errors being harvested. These types of errors
+are harvested by the 'edac_mc' class of device.
 Detecting CE events, then harvesting those events and reporting them,
 CAN be a predictor of future UE events.  With CE events, the system can
@@ -25,9 +45,27 @@ continue to operate, but with less safety. Preventive 
maintenance and
 proactive part replacement of memory DIMMs exhibiting CEs can reduce
 the likelihood of the dreaded UE events and system 'panics'.
+A new feature for EDAC, the edac_device class of device, was added in
+the 2.6.23 version of the kernel.
+This new device type allows for non-memory type of ECC hardware detectors
+to have their states harvested and presented to userspace via the sysfs
+Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA
+engines, fabric switches, main data path switches, interconnections,
+and various other hardware data paths. If the hardware reports it, then
+a edac_device device probably can be constructed to harvest and present
+that to userspace.
 In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
 in order to determine if errors are occurring on data transfers.
 The presence of PCI Parity errors must be examined with a grain of salt.
 There are several add-in adapters that do NOT follow the PCI specification
 with regards to Parity generation and reporting. The specification says
@@ -35,11 +73,17 @@ the vendor should tie the parity status bits to 0 if they 
do not intend
 to generate parity.  Some vendors do not do this, and thus the parity bit
 can "float" giving false positives.
-[There are patches in the kernel queue which will allow for storage of
-quirks of PCI devices reporting false parity positives. The 2.6.18
-kernel should have those patches included. When that becomes available,
-then EDAC will be patched to utilize that information to "skip" such
+In the kernel there is a pci device attribute located in sysfs that is
+checked by the EDAC PCI scanning code. If that attribute is set,
+PCI parity/error scannining is skipped for that device. The attribute
+       broken_parity_status
+as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys for
+PCI devices.
 EDAC will have future error detectors that will be integrated with
 EDAC or added to it, in the following list:
@@ -57,13 +101,14 @@ and the like.
-EDAC is composed of a "core" module (edac_mc.ko) and several Memory
+EDAC is composed of a "core" module (edac_core.ko) and several Memory
 Controller (MC) driver modules. On a given system, the CORE
 is loaded and one MC driver will be loaded. Both the CORE and
-the MC driver have individual versions that reflect current release
-level of their respective modules.  Thus, to "report" on what version
-a system is running, one must report both the CORE's and the
-MC driver's versions.
+the MC driver (or edac_device driver) have individual versions that reflect
+current release level of their respective modules.
+Thus, to "report" on what version a system is running, one must report both
+the CORE's and the MC driver's versions.
@@ -88,8 +133,9 @@ EDAC sysfs INTERFACE
 EDAC presents a 'sysfs' interface for control, reporting and attribute
 reporting purposes.
-EDAC lives in the /sys/devices/system/edac directory. Within this directory
-there currently reside 2 'edac' components:
+EDAC lives in the /sys/devices/system/edac directory.
+Within this directory there currently reside 2 'edac' components:
        mc      memory controller(s) system
        pci     PCI control and status system
@@ -188,7 +234,7 @@ In directory 'mc' are EDAC system overall control and 
attribute files:
 Panic on UE control file:
-       'panic_on_ue'
+       'edac_mc_panic_on_ue'
        An uncorrectable error will cause a machine panic.  This is usually
        desirable.  It is a bad idea to continue when an uncorrectable error
@@ -199,12 +245,12 @@ Panic on UE control file:
        LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
-       RUN TIME:  echo "1" >/sys/devices/system/edac/mc/panic_on_ue
+       RUN TIME:  echo "1" >/sys/devices/system/edac/mc/edac_mc_panic_on_ue
 Log UE control file:
-       'log_ue'
+       'edac_mc_log_ue'
        Generate kernel messages describing uncorrectable errors.  These errors
        are reported through the system message log system.  UE statistics
@@ -212,12 +258,12 @@ Log UE control file:
        LOAD TIME: module/kernel parameter: log_ue=[0|1]
-       RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue
+       RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ue
 Log CE control file:
-       'log_ce'
+       'edac_mc_log_ce'
        Generate kernel messages describing correctable errors.  These
        errors are reported through the system message log system.
@@ -225,12 +271,12 @@ Log CE control file:
        LOAD TIME: module/kernel parameter: log_ce=[0|1]
-       RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce
+       RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ce
 Polling period control file:
-       'poll_msec'
+       'edac_mc_poll_msec'
        The time period, in milliseconds, for polling for error information.
        Too small a value wastes resources.  Too large a value might delay
@@ -241,7 +287,7 @@ Polling period control file:
        LOAD TIME: module/kernel parameter: poll_msec=[0|1]
-       RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
+       RUN TIME: echo "1000" >/sys/devices/system/edac/mc/edac_mc_poll_msec
@@ -587,3 +633,95 @@ Parity Count:
+EDAC_DEVICE type of device
+In the header file, edac_core.h, there is a series of edac_device structures
+and APIs for the EDAC_DEVICE.
+User space access to an edac_device is through the sysfs interface.
+At the location /sys/devices/system/edac (sysfs) new edac_device devices will
+There is a three level tree beneath the above 'edac' directory. For example,
+the 'test_device_edac' device (found at the website)
+installs itself as:
+       /sys/devices/systm/edac/test-instance
+in this directory are various controls, a symlink and one or more 'instance'
+The standard default controls are:
+       log_ce          boolean to log CE events
+       log_ue          boolean to log UE events
+       panic_on_ue     boolean to 'panic' the system if an UE is encountered
+                       (default off, can be set true via startup script)
+       poll_msec       time period between POLL cycles for events
+The test_device_edac device adds at least one of its own custom control:
+       test_bits       which in the current test driver does nothing but
+                       show how it is installed. A ported driver can
+                       add one or more such controls and/or attributes
+                       for specific uses.
+                       One out-of-tree driver uses controls here to allow
+                       for ERROR INJECTION operations to hardware
+                       injection registers
+The symlink points to the 'struct dev' that is registered for this edac_device.
+One or more instance directories are present. For the 'test_device_edac' case:
+       test-instance0
+In this directory there are two default counter attributes, which are totals of
+counter in deeper subdirectories.
+       ce_count        total of CE events of subdirectories
+       ue_count        total of UE events of subdirectories
+At the lowest directory level is the 'block' directory. There can be 0, 1
+or more blocks specified in each instance.
+       test-block0
+In this directory the default attributes are:
+       ce_count        which is counter of CE events for this 'block'
+                       of hardware being monitored
+       ue_count        which is counter of UE events for this 'block'
+                       of hardware being monitored
+The 'test_device_edac' device adds 4 attributes and 1 control:
+       test-block-bits-0       for every POLL cycle this counter
+                               is incremented
+       test-block-bits-1       every 10 cycles, this counter is bumped once,
+                               and test-block-bits-0 is set to 0
+       test-block-bits-2       every 100 cycles, this counter is bumped once,
+                               and test-block-bits-1 is set to 0
+       test-block-bits-3       every 1000 cycles, this counter is bumped once,
+                               and test-block-bits-2 is set to 0
+       reset-counters          writing ANY thing to this control will
+                               reset all the above counters.
+Use of the 'test_device_edac' driver should any others to create their own
+unique drivers for their hardware systems.
+The 'test_device_edac' sample driver is located at the project site for EDAC.
To unsubscribe from this list: send the line "unsubscribe git-commits-head" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

Reply via email to