Re: Policeman Jenkins => new hardware

Houston Putman Tue, 18 Mar 2025 10:09:55 -0700

Thanks for all the work here Uwe!

I see that the OSX builds are a part of your TODOs, but with the new
hardware, do you expect the OSX VM to be faster, or is the VM not living on
the same hardware?
We see a ton of OSX build failures because the "eventual consistency" in
the tests doesn't expect the hardware to be quite as slow as the OSX VM
is...


- Houston

On Tue, Mar 18, 2025 at 12:03 PM Uwe Schindler <[email protected]> wrote:

> P.S.: Fun fact: The old policeman server's NVME SSDs were long after
> their lifetime - so it was the main reason to replace it (the failing
> network adaptor was just earlier). Lucene did a good job to burn the
> SSDs. It is still interesting that Lucene/Solr's tests write more than
> they read(!?!???!):
>
> # nvme smart-log /dev/nvme0
> Smart Log for NVME device:nvme0 namespace-id:ffffffff
> critical_warning                        : 0
> temperature                             : 45 °C (318 K)
> available_spare                         : 100%
> available_spare_threshold               : 10%
> *percentage_used : 211%*
> endurance group critical warning summary: 0
> *Data Units Read : 317332814 (162.47 TB) Data Units Written : 2383037910
> (1.22 PB)*
> host_read_commands                      : 10452268853
> host_write_commands                     : 57744004908
> controller_busy_time                    : 73212
> power_cycles                            : 9
> power_on_hours                          : 45321
> unsafe_shutdowns                        : 4
> media_errors                            : 0
> num_err_log_entries                     : 0
> Warning Temperature Time                : 0
> Critical Composite Temperature Time     : 0
> Temperature Sensor 1                    : 45 °C (318 K)
> Thermal Management T1 Trans Count       : 0
> Thermal Management T2 Trans Count       : 0
> Thermal Management T1 Total Time        : 0
> Thermal Management T2 Total Time        : 0
>
> # nvme smart-log /dev/nvme1
> Smart Log for NVME device:nvme1 namespace-id:ffffffff
> critical_warning                        : 0
> temperature                             : 42 °C (315 K)
> available_spare                         : 100%
> available_spare_threshold               : 10%*percentage_used : 217%
> *endurance group critical warning summary: 0
> *Data Units Read : 152984082 (78.33 TB) Data Units Written : 2385237910
> (1.22 PB)*
> host_read_commands                      : 1870329041
> host_write_commands                     : 57743490085
> controller_busy_time                    : 62644
> power_cycles                            : 9
> power_on_hours                          : 45321
> unsafe_shutdowns                        : 4
> media_errors                            : 0
> num_err_log_entries                     : 0
> Warning Temperature Time                : 0
> Critical Composite Temperature Time     : 0
> Temperature Sensor 1                    : 42 °C (315 K)
> Thermal Management T1 Trans Count       : 0
> Thermal Management T2 Trans Count       : 0
> Thermal Management T1 Total Time        : 0
> Thermal Management T2 Total Time        : 0
>
> Uwe
>
> Am 18.03.2025 um 17:52 schrieb Uwe Schindler:
> >
> > Moin moin,
> >
> > Policeman Jenkins got new hardware yesterday - no functional changes.
> >
> > Background: The old server had some strange problems with the
> > networking adaptor (Intel's "igb" kernel driver) about "Detected Tx
> > Unit Hang". This caused some short downtimes and the monitoring
> > complained all the time about lost pings which drove me crazy at
> > weekend. It worked better after a restart and also with downgrade of
> > kernel, but as I was about to replace the machine by a newer one, I
> > ordered a replacement to new Hardware version (previously it was
> > Hetzner AX51-NVME; now it is: Hetzner AX52).
> >
> > The migration was done starting yesterday lunch time europe (12:00
> > CET) in the by booting the new server in the datacenter's recovery
> > environment booted from network on both servers with a temporary IP
> > and then mounting both root disks and doing a large rsync (with
> > checksums, external attributes, numeric uid/gid and delete option).
> > Luckily this worked with the old server (the Intel Adapter did not
> > break). The whole downtime should have taken only 1 to 1.5 hours (the
> > time copy with 1 GBits and reconfig needs), but unfortunately the
> > PCIexpress on the new server complained about (recoverable) errors on
> > the NVME communications. After some support roundtrips (they first
> > replaced only the failing NVME controller which did not help), the
> > replaced the whole server.
> >
> > At 18:30 CET, I started copy to new server again and all went well,
> > dmesg showed no PCI express checksum errors. Finally, after fixing
> > boot (the old server used MBR the new one EFI), the server was mounted
> > at the original location by the team and all IPv4 adresses and IPv6
> > network were available. Since then (approx 20:30 CET), Policeman
> > Jenkins is back and running.
> >
> > The TODOs for the future:
> >
> >   * Replace the MacOS VM and update it to a new version (it's
> >     complicated, as it is a "Hackintosh", so it shouldn't be there
> >     according to Apple)
> >   * Possibly migrate away from VirtualBOX to KVM, but it's unclear if
> >     Hackintoshs work there.
> >
> > Have fun with the new hardware, the builds on Lucene main branch are
> > now 1.5 times faster (10 instead of 15 minutes).
> >
> > The new hardware is described here:
> > https://www.hetzner.com/dedicated-rootserver/ax52/; it has AVX 512....
> > let's see what comes out. No test failures yet.
> >
> > vendor_id       : AuthenticAMD
> > cpu family      : 25
> > model           : 97
> > model name      : AMD Ryzen 7 7700 8-Core Processor
> > stepping        : 2
> > microcode       : 0xa601209
> > cpu MHz         : 5114.082
> > cache size      : 1024 KB
> > physical id     : 0
> > siblings        : 16
> > core id         : 7
> > cpu cores       : 8
> > apicid          : 15
> > initial apicid  : 15
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 16
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
> > fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl
> > xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq
> > monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c
> > rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
> > 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb
> > bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba
> > perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2
> > smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap
> > avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt
> > xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
> > cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru
> > wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean
> > flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic
> > v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes
> > vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid
> > overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
> > bugs            : sysret_ss_attrs spectre_v1 spectre_v2
> > spec_store_bypass srso
> > bogomips        : 7585.28
> > TLB size        : 3584 4K pages
> > clflush size    : 64
> > cache_alignment : 64
> > address sizes   : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
> >
> > # lspci | fgrep -i volati
> > 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
> > NVMe SSD Controller PM9A1/PM9A3/980PRO
> > 02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400
> > NVMe SSD [Hendrix]
> >
> > I have no idea why the replacement server has two different NVME SSDs,
> > but you never know before what you get! From smart info I know that
> > both SSDs were fresh (6 hours total uptime only).
> >
> > Uwe
> >
> > --
> >
> > Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de
> > eMail: [email protected]
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:[email protected]
>

Re: Policeman Jenkins => new hardware

Reply via email to