Thanks for all the work here Uwe! I see that the OSX builds are a part of your TODOs, but with the new hardware, do you expect the OSX VM to be faster, or is the VM not living on the same hardware? We see a ton of OSX build failures because the "eventual consistency" in the tests doesn't expect the hardware to be quite as slow as the OSX VM is...
- Houston On Tue, Mar 18, 2025 at 12:03 PM Uwe Schindler <u...@thetaphi.de> wrote: > P.S.: Fun fact: The old policeman server's NVME SSDs were long after > their lifetime - so it was the main reason to replace it (the failing > network adaptor was just earlier). Lucene did a good job to burn the > SSDs. It is still interesting that Lucene/Solr's tests write more than > they read(!?!???!): > > # nvme smart-log /dev/nvme0 > Smart Log for NVME device:nvme0 namespace-id:ffffffff > critical_warning : 0 > temperature : 45 °C (318 K) > available_spare : 100% > available_spare_threshold : 10% > *percentage_used : 211%* > endurance group critical warning summary: 0 > *Data Units Read : 317332814 (162.47 TB) Data Units Written : 2383037910 > (1.22 PB)* > host_read_commands : 10452268853 > host_write_commands : 57744004908 > controller_busy_time : 73212 > power_cycles : 9 > power_on_hours : 45321 > unsafe_shutdowns : 4 > media_errors : 0 > num_err_log_entries : 0 > Warning Temperature Time : 0 > Critical Composite Temperature Time : 0 > Temperature Sensor 1 : 45 °C (318 K) > Thermal Management T1 Trans Count : 0 > Thermal Management T2 Trans Count : 0 > Thermal Management T1 Total Time : 0 > Thermal Management T2 Total Time : 0 > > # nvme smart-log /dev/nvme1 > Smart Log for NVME device:nvme1 namespace-id:ffffffff > critical_warning : 0 > temperature : 42 °C (315 K) > available_spare : 100% > available_spare_threshold : 10%*percentage_used : 217% > *endurance group critical warning summary: 0 > *Data Units Read : 152984082 (78.33 TB) Data Units Written : 2385237910 > (1.22 PB)* > host_read_commands : 1870329041 > host_write_commands : 57743490085 > controller_busy_time : 62644 > power_cycles : 9 > power_on_hours : 45321 > unsafe_shutdowns : 4 > media_errors : 0 > num_err_log_entries : 0 > Warning Temperature Time : 0 > Critical Composite Temperature Time : 0 > Temperature Sensor 1 : 42 °C (315 K) > Thermal Management T1 Trans Count : 0 > Thermal Management T2 Trans Count : 0 > Thermal Management T1 Total Time : 0 > Thermal Management T2 Total Time : 0 > > Uwe > > Am 18.03.2025 um 17:52 schrieb Uwe Schindler: > > > > Moin moin, > > > > Policeman Jenkins got new hardware yesterday - no functional changes. > > > > Background: The old server had some strange problems with the > > networking adaptor (Intel's "igb" kernel driver) about "Detected Tx > > Unit Hang". This caused some short downtimes and the monitoring > > complained all the time about lost pings which drove me crazy at > > weekend. It worked better after a restart and also with downgrade of > > kernel, but as I was about to replace the machine by a newer one, I > > ordered a replacement to new Hardware version (previously it was > > Hetzner AX51-NVME; now it is: Hetzner AX52). > > > > The migration was done starting yesterday lunch time europe (12:00 > > CET) in the by booting the new server in the datacenter's recovery > > environment booted from network on both servers with a temporary IP > > and then mounting both root disks and doing a large rsync (with > > checksums, external attributes, numeric uid/gid and delete option). > > Luckily this worked with the old server (the Intel Adapter did not > > break). The whole downtime should have taken only 1 to 1.5 hours (the > > time copy with 1 GBits and reconfig needs), but unfortunately the > > PCIexpress on the new server complained about (recoverable) errors on > > the NVME communications. After some support roundtrips (they first > > replaced only the failing NVME controller which did not help), the > > replaced the whole server. > > > > At 18:30 CET, I started copy to new server again and all went well, > > dmesg showed no PCI express checksum errors. Finally, after fixing > > boot (the old server used MBR the new one EFI), the server was mounted > > at the original location by the team and all IPv4 adresses and IPv6 > > network were available. Since then (approx 20:30 CET), Policeman > > Jenkins is back and running. > > > > The TODOs for the future: > > > > * Replace the MacOS VM and update it to a new version (it's > > complicated, as it is a "Hackintosh", so it shouldn't be there > > according to Apple) > > * Possibly migrate away from VirtualBOX to KVM, but it's unclear if > > Hackintoshs work there. > > > > Have fun with the new hardware, the builds on Lucene main branch are > > now 1.5 times faster (10 instead of 15 minutes). > > > > The new hardware is described here: > > https://www.hetzner.com/dedicated-rootserver/ax52/; it has AVX 512.... > > let's see what comes out. No test failures yet. > > > > vendor_id : AuthenticAMD > > cpu family : 25 > > model : 97 > > model name : AMD Ryzen 7 7700 8-Core Processor > > stepping : 2 > > microcode : 0xa601209 > > cpu MHz : 5114.082 > > cache size : 1024 KB > > physical id : 0 > > siblings : 16 > > core id : 7 > > cpu cores : 8 > > apicid : 15 > > initial apicid : 15 > > fpu : yes > > fpu_exception : yes > > cpuid level : 16 > > wp : yes > > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext > > fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl > > xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq > > monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c > > rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse > > 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb > > bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba > > perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 > > smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap > > avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt > > xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total > > cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru > > wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean > > flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic > > v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes > > vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid > > overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze > > bugs : sysret_ss_attrs spectre_v1 spectre_v2 > > spec_store_bypass srso > > bogomips : 7585.28 > > TLB size : 3584 4K pages > > clflush size : 64 > > cache_alignment : 64 > > address sizes : 48 bits physical, 48 bits virtual > > power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14] > > > > # lspci | fgrep -i volati > > 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd > > NVMe SSD Controller PM9A1/PM9A3/980PRO > > 02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 > > NVMe SSD [Hendrix] > > > > I have no idea why the replacement server has two different NVME SSDs, > > but you never know before what you get! From smart info I know that > > both SSDs were fresh (6 hours total uptime only). > > > > Uwe > > > > -- > > > > Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de > > eMail: u...@thetaphi.de > > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail:u...@thetaphi.de >