I suspect memory errors on my Sol 10 u8 system, but there are no memory errors
reported by "fmdump -eV". All the errors and events are zfs related.
Initial symptom is starting a scrub on a freshly booted system will complete
properly, but the same operation after the system has been up for a few days
will cause a kernel panic. Immediately after a reboot a scrub will complete
normally. This behavior suggests bit fade to me.
This has been very consistent for the last few months. The system is an HP
Z400 which is 10 years old and generally has run 24x7. It was certified by Sun
for Solaris 10 which is why I bought it and uses unbuffered, unregistered ECC
DDR3 DIMMs. Since my initial purchase I have bought three more Z400s.
Recently the system became unstable to the point I have not been able to
complete a "zfs send -R" to a 12 TB WD USB drive. My last attempt using a
Hipster LiveImage died after ~25 hours.
My Hipster 2017.10 system shows some events which appear to be ECC related, but
I'm not able to interpret them. I've attached a file with the last such event.
Not sure that will work, but worth trying. This is from my regular internet
access host. So it is up 24x7 with few exceptions.
Except for the CPU and memory, the machines are almost identical. The Hipster
machine is an older 4 DIMM slot machine with the same 3 way mirror on s0 and 3
disk RAIDZ1 on s1. The Sol 10 system is a 6 DIMM slot model and has a 3 TB
mirrored scratch pool in addition to the s0 & s1 root and export pools.
It seems unlikely that I could simply swap the disks between the two, but I can
install Hipster on a single drive for rpool and attempt to copy the scratch
pool, spool, with that and simply run it for a while for testing.
I've read everything I can find about the Fault Manager, but it has produced
more questions than answers.
This is for Hipster 2017.10:
sun_x86%rhb {82} fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-retire 1.1 active CPU/Memory Retire Agent
disk-lights 1.0 active Disk Lights Agent
disk-transport 1.0 active Disk Transport Agent
eft 1.16 active eft diagnosis engine
ext-event-transport 0.2 active External FM event transport
fabric-xlate 1.0 active Fabric Ereport Translater
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 2.0 active I/O Retire Agent
sensor-transport 1.1 active Sensor Transport Agent
ses-log-transport 1.0 active SES Log Transport Agent
software-diagnosis 0.1 active Software Diagnosis engine
software-response 0.1 active Software Response Agent
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.1 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent
It's a little longer than for Sol 10 u8, but the cpumem-retire V 1.1 appears on
both.
Suggestions?
Thanks,
Reg
Feb 26 2018 07:45:04.212790281 ereport.cpu.intel.quickpath.mem_ce
nvlist version: 0
class = ereport.cpu.intel.quickpath.mem_ce
ena = 0x98a2158a97802001
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = motherboard
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = chip
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = memory-controller
hc-id = 0
(end hc-list[2])
(end detector)
compound_errorname = MC_CH_RD_ERR
IA32_MCG_STATUS = 0x0
machine_check_in_progress = 0
bank_number = 0x8
bank_msr_offset = 0x420
IA32_MCi_STATUS = 0x8c0000400001009f
overflow = 0
error_uncorrected = 0
error_enabled = 0
processor_context_corrupt = 0
error_code = 0x9f
model_specific_error_code = 0x1
threshold_based_error_status = No tracking
IA32_MCi_ADDR = 0xc28d2b40
IA32_MCi_MISC = 0xe6323d8000010885
ECC-syndrome = 0xe6323d80
physaddr = 0xc28d2b40
resource = (array of embedded nvlists)
(start resource[0])
nvlist version: 0
version = 0x0
scheme = hc
hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = motherboard
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = chip
hc-id = 0
(end hc-list[1])
(start hc-list[2])
nvlist version: 0
hc-name = memory-controller
hc-id = 0
(end hc-list[2])
(start hc-list[3])
nvlist version: 0
hc-name = dram-channel
hc-id = 0
(end hc-list[3])
(start hc-list[4])
nvlist version: 0
hc-name = dimm
hc-id = 1
(end hc-list[4])
hc-specific = (embedded nvlist)
nvlist version: 0
offset = 0xffffffffffffffff
(end hc-specific)
(end resource[0])
mem_cor_ecc_counter = 0xffffffff 0xffffffff 0xffffffff 0xffffffff
0xffffffff 0xffffffff
mem_cor_ecc_counter_last = 0xffffffff 0xffffffff 0xffffffff 0xffffffff
0xffffffff 0xffffffff
__ttl = 0x1
__tod = 0x5a940f60 0xcaeec09
_______________________________________________
openindiana-discuss mailing list
[email protected]
https://openindiana.org/mailman/listinfo/openindiana-discuss