Re: Policeman Jenkins => new hardware

Uwe Schindler Tue, 18 Mar 2025 10:04:10 -0700

P.S.: Fun fact: The old policeman server's NVME SSDs were long aftertheir lifetime - so it was the main reason to replace it (the failingnetwork adaptor was just earlier). Lucene did a good job to burn theSSDs. It is still interesting that Lucene/Solr's tests write more thanthey read(!?!???!):


# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 45 °C (318 K)
available_spare                         : 100%
available_spare_threshold               : 10%
*percentage_used : 211%*
endurance group critical warning summary: 0

*Data Units Read : 317332814 (162.47 TB) Data Units Written : 2383037910(1.22 PB)*

host_read_commands                      : 10452268853
host_write_commands                     : 57744004908
controller_busy_time                    : 73212
power_cycles                            : 9
power_on_hours                          : 45321
unsafe_shutdowns                        : 4
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1                    : 45 °C (318 K)
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0


# nvme smart-log /dev/nvme1
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 42 °C (315 K)
available_spare                         : 100%
available_spare_threshold               : 10%*percentage_used : 217% *endurance 
group critical warning summary: 0

*Data Units Read : 152984082 (78.33 TB) Data Units Written : 2385237910(1.22 PB)*

host_read_commands                      : 1870329041
host_write_commands                     : 57743490085
controller_busy_time                    : 62644
power_cycles                            : 9
power_on_hours                          : 45321
unsafe_shutdowns                        : 4
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 1                    : 42 °C (315 K)
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

Uwe

Am 18.03.2025 um 17:52 schrieb Uwe Schindler:

Moin moin,

Policeman Jenkins got new hardware yesterday - no functional changes.
Background: The old server had some strange problems with thenetworking adaptor (Intel's "igb" kernel driver) about "Detected TxUnit Hang". This caused some short downtimes and the monitoringcomplained all the time about lost pings which drove me crazy atweekend. It worked better after a restart and also with downgrade ofkernel, but as I was about to replace the machine by a newer one, Iordered a replacement to new Hardware version (previously it wasHetzner AX51-NVME; now it is: Hetzner AX52).
The migration was done starting yesterday lunch time europe (12:00CET) in the by booting the new server in the datacenter's recoveryenvironment booted from network on both servers with a temporary IPand then mounting both root disks and doing a large rsync (withchecksums, external attributes, numeric uid/gid and delete option).Luckily this worked with the old server (the Intel Adapter did notbreak). The whole downtime should have taken only 1 to 1.5 hours (thetime copy with 1 GBits and reconfig needs), but unfortunately thePCIexpress on the new server complained about (recoverable) errors onthe NVME communications. After some support roundtrips (they firstreplaced only the failing NVME controller which did not help), thereplaced the whole server.
At 18:30 CET, I started copy to new server again and all went well,dmesg showed no PCI express checksum errors. Finally, after fixingboot (the old server used MBR the new one EFI), the server was mountedat the original location by the team and all IPv4 adresses and IPv6network were available. Since then (approx 20:30 CET), PolicemanJenkins is back and running.
The TODOs for the future:

  * Replace the MacOS VM and update it to a new version (it's
    complicated, as it is a "Hackintosh", so it shouldn't be there
    according to Apple)
  * Possibly migrate away from VirtualBOX to KVM, but it's unclear if
    Hackintoshs work there.
Have fun with the new hardware, the builds on Lucene main branch arenow 1.5 times faster (10 instead of 15 minutes).
The new hardware is described here:https://www.hetzner.com/dedicated-rootserver/ax52/; it has AVX 512....let's see what comes out. No test failures yet.
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 97
model name      : AMD Ryzen 7 7700 8-Core Processor
stepping        : 2
microcode       : 0xa601209
cpu MHz         : 5114.082
cache size      : 1024 KB
physical id     : 0
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pgemca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxextfxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 noplxtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdqmonitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16crdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nbbpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mbaperfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smapavx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveoptxsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_totalcqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpruwbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cleanflushbyasid decodeassists pausefilter pfthreshold avic vgif x2avicv_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaesvpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpidoverflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freezebugs : sysret_ss_attrs spectre_v1 spectre_v2spec_store_bypass srso
bogomips        : 7585.28
TLB size        : 3584 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

# lspci | fgrep -i volati
01:00.0 Non-Volatile memory controller: Samsung Electronics Co LtdNVMe SSD Controller PM9A1/PM9A3/980PRO02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400NVMe SSD [Hendrix]
I have no idea why the replacement server has two different NVME SSDs,but you never know before what you get! From smart info I know thatboth SSDs were fresh (6 hours total uptime only).
Uwe

--
Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.deeMail: [email protected]


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]

Re: Policeman Jenkins => new hardware

Reply via email to