On 02/17/09 11:01, Lisa Curhan wrote:
> Hi Tarik-
>
>  Do you agree with Arrow's assesment that this fault is due to a Solaris
> FMA code problem rather than a hardware problem?  

Its hard to say without some more information.
There could be both a HW issue and SW issue.

First of all we know that we're getting an "ereport.cpu.intel.unknown".
I'm guessing that probably wouldn't happen unless there was some sort of
CPU error
to start with. I'm not sure where that ereport gets generated or why its
"unknown".
Gavin Maltby and Adrian Frost are your best bet since they worked on the
intel
cpu/mem FMA code.

It looks to me like these ereports should be discarded in the rules.
prop upset.disc...@chip/cpu (0)->
ereport.cpu.intel.exter...@chip/cpu,
ereport.cpu.intel.unkn...@chip/cpu;

However you got an "undiagnosable" defect with a reason of
"ereport zero not found in instance tree". This is most likely
due to a problem with:
- the DE rules
- the topology
- the ereport content (detector fmri)

Can we get a `fmdump -eV` and a `fmtopo -pV` output from the system?
This would help us figure out why the diagnosis failed.

> The e-report points to
> an intel CPU, but the SunVTS tests running when this fault occurred did
> not fail.
>   

SunVTS will not necessarily fail unless its checking for ereports or faults.

-tarik
>  If you or one of the other Solaris FMA experts could get back to us
> today with some advice it would be much appreciated.  This failure mode
> is occurring both in system test (SAT) and during the RQT testing and
> Vayu is at a critical point in time where we need to make progress as
> quickly as possible.  I believe we are using X64 S10u6 in this testing,
> but I'm not sure which patches we have.
>
> Lisa
> ---------------------------------------------------
> On 02/17/09 01:45, Arrow.luo wrote:
>   
>> Hi Dauker,
>>
>> This failure should be not relative with FMD device, this is the minor 
>> FMA failure which may cause by incorrect FMA software. Attached files 
>> are the detail explanation for this failure.
>>
>> *Suggested Action for System Administrator*
>>
>>     Run pkgchk -n SUNWfmd to ensure that fault management software is
>>     installed properly. Contact Sun for support. 
>>
>> *Details*
>>
>>     The Message ID:   *SUNOS-8000-1L* indicates errors were detected but
>>     the module responsible for diagnosing those errors was unable to
>>     arrive at a diagnosis. This may be caused more commonly by downrev
>>     or incompatible revisions of software, or less frequently by bugs in
>>     the diagnosis module or subsystem.
>>
>>     For SPARC based systems running Solaris 10, the kernel patch 127111
>>     and Solaris FMA patch 119578 are particularly critical and should be
>>     examined. Ensure that any dependencies documented in the patch
>>     README files are noted and addressed.
>>
>>     For x64 based systems running Solaris 10, the kernel patch 127112
>>     and Solaris FMA patches 118344 and 125370 are particularly critical
>>     and should be examined. Ensure that any dependencies documented in
>>     the patch README files are noted and addressed.
>>
>> Please help check the detail SUNWfmd package revision and make sure if 
>> the lasted Soalris FMA patches have been installed.
>>
>> Kind regards,
>> ArrowL
>>
>> dauker....@mic.com.tw wrote:
>>     
>>> *Dear Lisa*
>>>
>>> * *
>>>
>>> *The below message show fma check fail in VTS test*
>>>
>>> * *
>>>
>>> * *
>>>
>>> */0215PEV002-1/ 
>>> <http://192.168.100.12:8080/nexus/view?domain_name=ALL&batch_name=0215PEV002-1&unit_name=0215PEV002-1>*
>>>
>>> * *
>>>
>>> *Test suit : SAT*
>>>
>>> *Test set: VTS_EXCLUSIVE*
>>>
>>> *Test case: fma_filter_faultx*
>>>
>>> * *
>>>
>>> ##############################################################################
>>> fmcheck version 1.3 (r10262) starting Tue Feb 17 06:40:44 CST 2009
>>> ------------------------------------------------------------------------------
>>> fmcheck: current configuration:
>>>         fmcheck_code_map         = 
>>> /home/domain/nexus/units/0215PEV002-1/lib/fmcheck/FMA_NC_CODES
>>>         fmcheck_dump_raw         = 1
>>>         fmcheck_error_log        = 
>>>         fmcheck_fault_log        = 
>>>         fmcheck_filter_expr      = 
>>>         fmcheck_filter_file      = 
>>> /home/domain/nexus/units/0215PEV002-1/cfg/vayu/fma_filter_vayu.pl
>>>         fmcheck_match_invert     = 1
>>>         fmcheck_msg_format       = Fault %{suspect.class} on FRU 
>>> %{suspect.fru}
>>>         fmcheck_output_path      = 
>>>         fmcheck_show_all         = 1
>>>         fmcheck_source           = faultx
>>>         fmcheck_source_component = Ops::FMD::Log::Indirect
>>>         fmcheck_use_dict         = 
>>>         fmd_root_path            = 
>>>         fmdump_binary            = 
>>>         fmdump_xml_filename      = 
>>>         no_config                = 
>>>         no_special               = 
>>> ------------------------------------------------------------------------------
>>> Found 1 matching faultx event
>>> ------------------------------------------------------------------------------
>>> Fault Information: 
>>>        UUID: 36f47185-58a4-435e-faf5-b1f36c848e13
>>>        Time: Tue Feb 17 04:52:03.667716 2009
>>>     Article: http://www.sun.com/msg/SUNOS-8000-1L
>>>    Suspects: 
>>>              [0]    Class: defect.sunos.eft.undiagnosable_problem
>>>                 Certainty: 100%
>>>  
>>>    Error(s): 
>>>              [0]     Time: Tue Feb 17 04:52:03.649992423 2009
>>>                     Class: ereport.cpu.intel.unknown
>>>                  Detector: hc:///motherboard=0/chip=0/cpu=3
>>>  
>>> ------------------------------------------------------------------------------
>>> ==============================================================================
>>> Raw FMD Event Data
>>> ==============================================================================
>>> header = (embedded nvlist)
>>> nvlist version: 0
>>>         creator = fmd
>>>         hostname = test12-027
>>>         label = faultx
>>>         osrel = 5.10
>>>         osver = Generic_137138-08
>>>         plat = i86pc
>>>         uuid = 08f78d8d-430b-67e2-c835-9986a6ef3caa
>>>         version = 1.2
>>> (end header)
>>>  
>>> event-list = (embedded nvlist)
>>> (start event-list[0])
>>> nvlist version: 0
>>>         version = 0x0
>>>         class = list.suspect
>>>         uuid = 36f47185-58a4-435e-faf5-b1f36c848e13
>>>         diag-time = Tue Feb 17 04:52:03.667716 2009
>>>         __ttl = 0x1
>>>         __tod = 0x4999d1f3 0x2afe2250
>>>         code = SUNOS-8000-1L
>>>         de = 
>>> fmd://product-id=ARRAY(0x8565d9c),chassis-id=ARRAY(0x8565dc0),server-id=test12-027/mod-name=eft/:mod_version=1.16
>>>         error-list = (array of embedded nvlists)
>>>         (start error-list[0])
>>>         nvlist version: 0
>>>                 class = ereport.cpu.intel.unknown
>>>                 euid = c9b9-538ab98-45a6d7f-4ba275e-06b4c2b
>>>                 ena = 0x17b839277b106001
>>>                 error-time = Tue Feb 17 04:52:03.649992423 2009
>>>                 IA32_MCG_STATUS = 0x0
>>>                 IA32_MCi_ADDR = 0x181d5de40
>>>                 IA32_MCi_MISC = 0xc944312000085f47
>>>                 IA32_MCi_STATUS = 0xcc0001800001009f
>>>                 __tod = 0x4999d1f3 0x26be18e7
>>>                 __ttl = 0x1
>>>                 bank_msr_offset = 0x420
>>>                 bank_number = 0x8
>>>                 detector = hc:///motherboard=0/chip=0/cpu=3
>>>                 error_uncorrected = 0
>>>                 error_code = 0x9f
>>>                 error_enabled = 0
>>>                 machine_check_in_progress = 0
>>>                 model_specific_error_code = 0x1
>>>                 overflow = 1
>>>                 processor_context_corrupt = 0
>>>                 threshold_based_error_status = No tracking
>>>         (end error-list[0])
>>>  
>>>         error-list-sz = 0x1
>>>         fault-list = (array of embedded nvlists)
>>>         (start fault-list[0])
>>>         nvlist version: 0
>>>                 version = 0x0
>>>                 class = defect.sunos.eft.undiagnosable_problem
>>>                 certainty = 100
>>>                 reason = ereport zero not found in instance tree
>>>         (end fault-list[0])
>>>  
>>>         fault-status = 0x1
>>>         fault-list-sz = 0x1
>>> (end event-list[0])
>>>  
>>> (end event-list)
>>>  
>>> ==============================================================================
>>>
>>> SCRIPT=fmcheck EXITSTATUS=31 SYMPTOM=261001 TTF=0 TIME=20090217064044 
>>> EVENT_CODE=SUNOS-8000-1L SUSPECT_ERROR_CLASS=ereport.cpu.intel.unknown 
>>> SUSPECT_ERROR_DETECTOR=hc:///motherboard=0/chip=0/cpu=3 
>>> SUSPECT_FAULT_CLASS=defect.sunos.eft.undiagnosable_problem MSG="Fault 
>>> defect.sunos.eft.undiagnosable_problem on FRU unknown"**
>>>
>>>  
>>>
>>>  
>>>
>>> Best Regards
>>>
>>> Dauker,
>>>
>>> //RDD5 Software Design Dept.//
>>>
>>> //Enterprise//// System Bize Uinit //
>>>
>>> //MiTAC International Corp.//
>>>
>>> //TEL: +886 (3) 3289000 Ext 5211//
>>>
>>> //FAX: 886 (3) 3963477//
>>>
>>>  
>>>
>>>       
>> -- 
>> Regards.
>>
>> Arrow.luo
>> Operation Engineer
>> Asian Operation, WWOPS.
>> Sun Microsystems of California Limited
>> Phone : x58972/+86(0)20 85109972
>> Mobile: +86-13929171806
>> Email : arrow....@sun.com
>>
>>
>> ------------------------------------------------------------------------
>>
>>      Sun Microsystems Logo </>       
>>
>>  
>>
>>
>>             Article for Message ID:   *SUNOS-8000-1L*
>>
>>             
>> ------------------------------------------------------------------------
>>
>>
>>             *Eft cannot produce diagnosis*
>>
>>             *Type*
>>
>>                 Defect 
>>
>>             *Severity*
>>
>>                 Minor 
>>
>>             *Description*
>>
>>                 The EFT Diagnosis Engine encountered telemetry for which
>>                 it is unable to produce a diagnosis. 
>>
>>             *Automated Response*
>>
>>                 Error reports from the component will be logged for
>>                 examination by Sun. 
>>
>>             *Impact*
>>
>>                 Automated diagnosis and response for these events will
>>                 not occur. 
>>
>>             *Suggested Action for System Administrator*
>>
>>                 Run pkgchk -n SUNWfmd to ensure that fault management
>>                 software is installed properly. Contact Sun for support. 
>>
>>             *Details*
>>
>>                 The Message ID:   *SUNOS-8000-1L* indicates errors were
>>                 detected but the module responsible for diagnosing those
>>                 errors was unable to arrive at a diagnosis. This may be
>>                 caused more commonly by downrev or incompatible
>>                 revisions of software, or less frequently by bugs in the
>>                 diagnosis module or subsystem.
>>
>>                 For SPARC based systems running Solaris 10, the kernel
>>                 patch 127111 and Solaris FMA patch 119578 are
>>                 particularly critical and should be examined. Ensure
>>                 that any dependencies documented in the patch README
>>                 files are noted and addressed.
>>
>>                 For x64 based systems running Solaris 10, the kernel
>>                 patch 127112 and Solaris FMA patches 118344 and 125370
>>                 are particularly critical and should be examined. Ensure
>>                 that any dependencies documented in the patch README
>>                 files are noted and addressed.
>>
>>                 If all patches appear to be current and all dependencies
>>                 are properly addressed, error telemetry will need to be
>>                 examined by Service personnel to make a further
>>                 determination of the problem. Please see the information
>>                 below for certain platform specific patch requirements.
>>
>>                 For T1000,T2000,Netra T2000,T5120,and T5220 systems,
>>                 please make sure that the software and firmware patch
>>                 levels are as follows:
>>
>>                 1.) Solaris 5.10: kernel patch 127111-08 or higher
>>
>>                 2.) Solaris 5.10: FMA Patch 119578-30 or higher
>>
>>                 3.) System firmware:
>>
>>                 T5120 and T5220 firmware patch 127580-05 or higher
>>
>>                 T2000 firmware patch 136927-01 or higher
>>
>>                 Netra T2000 firmware patch 136929-01 or higher
>>
>>                 T1000 firmware patch 127577-03 or higher
>>
>>                 *Please Note -* On some system platforms there could be
>>                 additional information that may disclose a hardware
>>                 fault, see the following:
>>
>>                 On Sun SPARC Enterprise Mx000 systems, please check the
>>                 output of "showstatus" and "fmdump" from the xscfu for
>>                 possible hardware faults.
>>
>>                 On Sun Fire 12K, 15K, 20K and E25K systems, please check
>>                 the output of "showlogs -E -p e list 3" from the main
>>                 system controller for information about possible
>>                 hardware faults.
>>
>>                 On Sun Fire 3800, 48x0, 6800, v1280, E2900, E4900, E6900
>>                 and Netra 1280/1290 systems, please check the output of
>>                 "showlogs -v" from the main system controller for
>>                 information about possible hardware faults.
>>
>>
>>
>>
>>
>>             
>> ------------------------------------------------------------------------
>>
>>
>>
>>
>>                   *Comments on this page?*
>>
>>
>>                   We welcome your feedback on this page. Please use the
>>                   form below to
>>                   comment on the above article. *Bug reports or requests
>>                   for support
>>                   should be directed to Sun Support
>>                   <http://www.sun.com/service/online/>.*
>>
>>                   Your Name:                  [optional]  
>>                   Email Address:              [optional]  
>>                     This page meets my need:         
>>                     disagree                                                 
>>          agree  
>>
>>                       
>>                   Comments:                   
>>                               
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>             For more information, contact Sun Support
>>             <http://www.sun.com/service/online/>.
>>
>>
>> ------------------------------------------------------------------------
>>
>>  HOME <http://events.central/>       Events Browser  UP 
>> <http://events.central/event/FMA//defect/sunos/eft/> 
>>
>>
>>
>>
>> Event Registry on events.central:
>> */events-gate*
>>
>>
>>     defect.sunos.eft.undiagnosable_problem
>>
>>
>> *Stability*
>>
>> Private
>>
>> *Description*
>>
>> EFT unable to produce a diagnosis defect
>>
>> *Sets containing this event*
>>
>> NotArcd <http://events.central/event/../set/FMA/NotArcd>
>>
>>
>> *Event Payload version 0*
>>
>>
>>      
>> *Event Payload inherited from defect 
>> <http://events/central/event/FMA/defect>*: 
>>        *      Name            Type      Description*
>>            certainty         uint8_t   Certainty [0..100] for this entry in 
>> list
>>                class          string   The event class
>>               module            fmri   The defective software module
>>              package            fmri   The package containing the defective 
>> software
>>              version         uint8_t   The major version of this event class
>>      
>> *Event Payload inherited from defect.sunos 
>> <http://events/central/event/FMA/defect/sunos>*: NONE
>>      
>> *Event Payload inherited from defect.sunos.eft 
>> <http://events/central/event/FMA/defect/sunos/eft>*: NONE
>>
>> *Event Payload:*
>>        *      Name            Type      Description*
>>               reason          string   text description of why eft unable to 
>> produce a diagnosis
>>
>>     
>
>   

_______________________________________________
fm-discuss mailing list
fm-discuss@opensolaris.org

Reply via email to