P.S.: Fun fact: The old policeman server's NVME SSDs were long after
their lifetime - so it was the main reason to replace it (the failing
network adaptor was just earlier). Lucene did a good job to burn the
SSDs. It is still interesting that Lucene/Solr's tests write more than
they read(!?!???!):
# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 45 °C (318 K)
available_spare : 100%
available_spare_threshold : 10%
*percentage_used : 211%*
endurance group critical warning summary: 0
*Data Units Read : 317332814 (162.47 TB) Data Units Written : 2383037910
(1.22 PB)*
host_read_commands : 10452268853
host_write_commands : 57744004908
controller_busy_time : 73212
power_cycles : 9
power_on_hours : 45321
unsafe_shutdowns : 4
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 45 °C (318 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
# nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning : 0
temperature : 42 °C (315 K)
available_spare : 100%
available_spare_threshold : 10%*percentage_used : 217% *endurance
group critical warning summary: 0
*Data Units Read : 152984082 (78.33 TB) Data Units Written : 2385237910
(1.22 PB)*
host_read_commands : 1870329041
host_write_commands : 57743490085
controller_busy_time : 62644
power_cycles : 9
power_on_hours : 45321
unsafe_shutdowns : 4
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 42 °C (315 K)
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Uwe
Am 18.03.2025 um 17:52 schrieb Uwe Schindler:
Moin moin,
Policeman Jenkins got new hardware yesterday - no functional changes.
Background: The old server had some strange problems with the
networking adaptor (Intel's "igb" kernel driver) about "Detected Tx
Unit Hang". This caused some short downtimes and the monitoring
complained all the time about lost pings which drove me crazy at
weekend. It worked better after a restart and also with downgrade of
kernel, but as I was about to replace the machine by a newer one, I
ordered a replacement to new Hardware version (previously it was
Hetzner AX51-NVME; now it is: Hetzner AX52).
The migration was done starting yesterday lunch time europe (12:00
CET) in the by booting the new server in the datacenter's recovery
environment booted from network on both servers with a temporary IP
and then mounting both root disks and doing a large rsync (with
checksums, external attributes, numeric uid/gid and delete option).
Luckily this worked with the old server (the Intel Adapter did not
break). The whole downtime should have taken only 1 to 1.5 hours (the
time copy with 1 GBits and reconfig needs), but unfortunately the
PCIexpress on the new server complained about (recoverable) errors on
the NVME communications. After some support roundtrips (they first
replaced only the failing NVME controller which did not help), the
replaced the whole server.
At 18:30 CET, I started copy to new server again and all went well,
dmesg showed no PCI express checksum errors. Finally, after fixing
boot (the old server used MBR the new one EFI), the server was mounted
at the original location by the team and all IPv4 adresses and IPv6
network were available. Since then (approx 20:30 CET), Policeman
Jenkins is back and running.
The TODOs for the future:
* Replace the MacOS VM and update it to a new version (it's
complicated, as it is a "Hackintosh", so it shouldn't be there
according to Apple)
* Possibly migrate away from VirtualBOX to KVM, but it's unclear if
Hackintoshs work there.
Have fun with the new hardware, the builds on Lucene main branch are
now 1.5 times faster (10 instead of 15 minutes).
The new hardware is described here:
https://www.hetzner.com/dedicated-rootserver/ax52/; it has AVX 512....
let's see what comes out. No test failures yet.
vendor_id : AuthenticAMD
cpu family : 25
model : 97
model name : AMD Ryzen 7 7700 8-Core Processor
stepping : 2
microcode : 0xa601209
cpu MHz : 5114.082
cache size : 1024 KB
physical id : 0
siblings : 16
core id : 7
cpu cores : 8
apicid : 15
initial apicid : 15
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl
xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq
monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c
rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb
bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba
perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2
smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap
avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt
xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru
wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean
flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic
v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes
vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid
overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
bugs : sysret_ss_attrs spectre_v1 spectre_v2
spec_store_bypass srso
bogomips : 7585.28
TLB size : 3584 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
# lspci | fgrep -i volati
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
NVMe SSD Controller PM9A1/PM9A3/980PRO
02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400
NVMe SSD [Hendrix]
I have no idea why the replacement server has two different NVME SSDs,
but you never know before what you get! From smart info I know that
both SSDs were fresh (6 hours total uptime only).
Uwe
--
Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de
eMail: u...@thetaphi.de
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de