Re: [BUG] amdgpu GPU fault detected in VM context
El Tue, 30 Jan 2024 15:28:00 -0400 Jose Maldonado escribió: > > Hello everyone! > > I have been detecting a bug in the new DRM code corresponding to > amdgpu using an RX580 (Polaris10). I am currently running -current > > **OpenBSD 7.4 GENERIC.MP#1637 amd64** > > And I'm seeing these errors that appear shortly after starting > Xenocara. > > drm:pid13609:gmc_v8_0_process_interrupt *ERROR* GPU fault detected: > 146 0x0420920c for process Xorg pid 63811 thread Xorg pid 349016 > drm:pid13609:gmc_v8_0_process_interrupt *ERROR* > VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010C284 > drm:pid13609:gmc_v8_0_process_interrupt *ERROR* > VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0809200C > Adding more information, since the error was just triggered and I was able to capture a different output in the dmesg. This line caught my attention a lot. Jan 30 18:22:29 volfread /bsd: WARNING acrtc_attach->pflip_status != AMDGPU_FLIP_NONE failed at /usr/src/sys/dev/pci/drm/amd/display/amdgpu_dm/amdgpu_dm.c:8293 Reviewing the code in question I see that it is related to the management of Page Flipping and its interaction with VRR (FreeSync), VSync and Mesa. I have disabled the VRR option on the monitor and in Xenocara I have always used the default option, without a configuration file, which indicates that VRR is inactive by default in amdgpu. I'll try to see if this solves the problem, and I'll let you know if there's any progress. -- * Dios en su cielo, todo bien en la Tierra dmesg-on-crash Description: Binary data
[BUG] amdgpu GPU fault detected in VM context
Hello everyone! I have been detecting a bug in the new DRM code corresponding to amdgpu using an RX580 (Polaris10). I am currently running -current **OpenBSD 7.4 GENERIC.MP#1637 amd64** And I'm seeing these errors that appear shortly after starting Xenocara. drm:pid13609:gmc_v8_0_process_interrupt *ERROR* GPU fault detected: 146 0x0420920c for process Xorg pid 63811 thread Xorg pid 349016 drm:pid13609:gmc_v8_0_process_interrupt *ERROR* VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010C284 drm:pid13609:gmc_v8_0_process_interrupt *ERROR* VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0809200C In any case, the addresses change every time the dmesg spams. I have been investigating and the error is not new, it has appeared before as you can see in the following links: https://www.mail-archive.com/misc@openbsd.org/msg179677.html https://www.mail-archive.com/bugs@openbsd.org/msg17844.html But the behavior in all cases is similar. The bug appears, spams the dmesg and from one moment to the next, breaks Xenocara and restarts the graphic server. I may be browsing the Internet, watching a video or simply moving a window, the error appears and triggers the Xenocara crash. Checking further, I find that a possible solution is to downgrade the amdgpu firmware https://bugzilla.kernel.org/show_bug.cgi?id=201957 But I have checked pre-current firmwares up to OpenBSD version 7.1 (amdgpu-firmware-20211027.tgz) and the firmware for Polaris10 is the same in any case, so I doubt this will solve the problem of bug spam and Xenocara crash. Complete dmesg attached. -- * Dios en su cielo, todo bien en la Tierra OpenBSD 7.4-current (GENERIC.MP) #1633: Sat Jan 27 08:06:43 MST 2024 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 34228117504 (32642MB) avail mem = 33169588224 (31632MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xdd9fb000 (62 entries) bios0: vendor American Megatrends International, LLC. version "A.E0" date 06/27/2023 bios0: Micro-Star International Co., Ltd. MS-7C95 efi0 at bios0: UEFI 2.7 efi0: American Megatrends rev 0x50011 acpi0 at bios0: ACPI 6.2 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SSDT SSDT SSDT FIDT MCFG HPET IVRS FPDT VFCT BGRT TPM2 PCCT SSDT CRAT CDIT SSDT SSDT SSDT SSDT WSMT APIC SSDT SSDT SSDT acpi0: wakeup devices GP12(S4) GP13(S4) XHC0(S4) GP30(S4) GP31(S4) GPP0(S4) GPP8(S4) GPP1(S4) PTXH(S4) PT20(S4) PT24(S4) PT26(S4) PT27(S4) PT28(S4) PT29(S4) acpitimer0 at acpi0: 3579545 Hz, 32 bits acpimcfg0 at acpi0 acpimcfg0: addr 0xf000, bus 0-127 acpihpet0 at acpi0: 14318180 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: AMD Ryzen 7 3800X 8-Core Processor, 4200.01 MHz, 17-71-00, patch 08701030 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,HWPSTATE,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,PQM,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA,UMIP,IBPB,STIBP,IBRS_PREF,IBRS_SM,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 512KB 64b/line 8-way L2 cache, 16MB 64b/line 16-way L3 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 8 var ranges, 88 fixed ranges cpu0: apic clock running at 100MHz cpu0: mwait min=64, max=64, C-substates=1.1, IBE cpu1 at mainbus0: apid 2 (application processor) cpu1: AMD Ryzen 7 3800X 8-Core Processor, 4200.00 MHz, 17-71-00, patch 08701030 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,HTT,SSE3,PCLMUL,MWAIT,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,NXE,MMXX,FFXSR,PAGE1GB,RDTSCP,LONG,LAHF,CMPLEG,SVM,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,IBS,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,MWAITX,HWPSTATE,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,PQM,RDSEED,ADX,SMAP,CLFLUSHOPT,CLWB,SHA,UMIP,IBPB,STIBP,IBRS_PREF,IBRS_SM,SSBD,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu1: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 512KB 64b/line 8-way L2 cache, 16MB 64b/line 16-way L3 cache cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 4 (application processor) cpu2: AMD Ryzen 7 3800X 8-Core Processor, 4200.00 MHz, 17-71-00, patch 08701030 cpu2:
chromium on -current #1637 crashes (perhaps graphic error?)
hi, after upgrading to snapshot #1637, upgraded packages, I found chromium cannot render properly. I suspect that is an issue of the graphic driver, since I encounter the issue on Alpine Linux (running unstable mesa driver) as well. Chromium generate kilobytes of logs and megabytes of ktrace output. the log just repeatly Location of variable sk_FragColor conflicts with another variable. OpenBSD 7.4-current (GENERIC.MP) #1637: Mon Jan 29 11:59:31 MST 2024 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP real mem = 8401256448 (8012MB) avail mem = 8125845504 (7749MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 2.7 @ 0xe (87 entries) bios0: vendor Dell Inc. version "A25" date 05/30/2019 bios0: Dell Inc. OptiPlex 9020 efi0 at bios0: UEFI 2.3.1 efi0: American Megatrends rev 0x4028d acpi0 at bios0: ACPI 5.0 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP APIC FPDT SLIC LPIT SSDT SSDT HPET SSDT MCFG SSDT ASF! SSDT BGRT DMAR TCPA acpi0: wakeup devices UAR1(S3) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) GLAN(S4) EHC1(S3) EHC2(S3) XHC_(S4) HDEF(S4) PEG0(S4) PEGP(S4) PEG1(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, 3192.88 MHz, 06-3c-03, patch 0028 cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu0: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 64b/line 8-way L2 cache, 6MB 64b/line 12-way L3 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 99MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4, IBE cpu1 at mainbus0: apid 2 (application processor) cpu1: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, 3192.81 MHz, 06-3c-03, patch 0028 cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu1: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 64b/line 8-way L2 cache, 6MB 64b/line 12-way L3 cache cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 4 (application processor) cpu2: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, 3192.77 MHz, 06-3c-03, patch 0028 cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu2: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 64b/line 8-way L2 cache, 6MB 64b/line 12-way L3 cache cpu2: smt 0, core 2, package 0 cpu3 at mainbus0: apid 6 (application processor) cpu3: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, 3193.00 MHz, 06-3c-03, patch 0028 cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN cpu3: 32KB 64b/line 8-way D-cache, 32KB 64b/line 8-way I-cache, 256KB 64b/line 8-way L2 cache, 6MB 64b/line 12-way L3 cache cpu3: smt 0, core 3, package 0 ioapic0 at mainbus0: apid 8 pa 0xfec0, version 20, 24 pins acpihpet0 at acpi0: 14318179 Hz acpimcfg0 at acpi0 acpimcfg0: addr 0xf800, bus 0-63 acpiprt0 at acpi0: bus 0 (PCI0) acpiprt1 at acpi0: bus -1 (PEG0) acpiprt2 at acpi0: bus -1 (PEG1) acpiprt3 at acpi0: bus -1 (PEG2) acpiec0 at acpi0: not present acpipci0 at acpi0 PCI0: 0x 0x0011 0x0001 acpicmos0 at acpi0 com0 at acpi0 UAR1 addr 0x3f8/0x8 irq 4: ns16550a, 16 byte fifo acpibtn0 at acpi0: PWRB(wakeup) "PNP0C14" at acpi0 not configured tpm0 at acpi0 TPM_ 1.2 (TIS) addr
Re: TSO em(4) problem
On 30.1.2024. 13:33, Alexander Bluhm wrote: > On Tue, Jan 30, 2024 at 12:07:08PM +0100, Hrvoje Popovski wrote: >> On 30.1.2024. 9:27, Hrvoje Popovski wrote: >>> I will prepare one box for this kind of traffic and will contact you and >>> marcus >>> In theory when going through vlan interface it should remove M_VLANTAG. But something must be wrong and I wonder what. bluhm >> >> Hi, >> >> I've managed to trigger watchdog in lab. It couldn't be possible without >> bluhm@ information about ix vlan, thank you. > > Great, now we can debug the details. > > I have to know how ix and em are connected. > > Do you have any bridge or veb? Where are your vlan trunks? > Any aggr, trunk, carp? no, only vlan on ix0. > Is my understanding of your setup corect? > > ix -> vlan -> forward -> em yes, and forwarding only without pf. I'm sending traffic from host connected to vlan/ix0 and forward through em5 to other host. I'm sending 1Gbps of traffic with cisco t-rex > Can something more happen, like > > ix -> forward -> em > In setup without vlan on ix I've got only one watchdog at the begging of testing and that's it. With vlan I'm getting around 6 or 7 watchdogs per minute which means 6 or 7 links going up/down. without vlan smc4# netstat -sp tcp | grep TSO 0 output TSO packets software chopped 268 output TSO packets hardware processed 0 output TSO packets generated 0 output TSO packets dropped smc4# netstat -sp tcp | grep LRO 0 input LRO packets passed through pseudo device 7666573 input LRO generated packets from hardware 21667579 input LRO coalesced packets by network device 0 input bad LRO packets dropped
kernel panic, PCe APU3, unknown trap in user mode, nodejs?
Hi, I found one of my PC Engines APU3 in kernel panic. What changed recently on this machine, I started node.js on it about two days ago. - it runs local installation of Zigbee2MQTT - nodejs / Zigbee2MQTT opens /dev/cuaU0 - cuaU0 is ITEAD SONOFF Zigbee 3.0 USB Dongle Plus V2 This is very new workload on this machine. Before I never had kernel panics on it, but that machine was never really busy with anything as it's mainly for experiments. I have it in the ddb prompt, if anyone would be intrested to see something more there. I will reboot probably tomorrow or so... ... starting local daemons: cron. Mon Jan 29 19:46:28 UTC 2024 OpenBSD/amd64 (pce-3967.home.local) (tty00) login: unknown trap 763636304 in user mode trap type 34043 code 4 rip a0bce1e2a cs 2b rflags 10206 cr2 cb5a038 cpl 0 rsp cbb524cc0 uvm_fault(0xfd81135c2858, 0x84fb, 0, 1) -> e kernel: page fault trap, code=0 Stopped at trap_print+0xed:leave TIDPIDUID PRFLAGS PFLAGS CPU COMMAND 478162 61157 32767 0x1823 0x4002 node *477361 61157 32767 0x1823 0x4003 node 291026 61157 32767 0x1823 0x4001 node 8163851d(uvm_fault(0xfd81135c2858, 0x84f3, 0, 1) -> e kernel: page fault trap, code=0 Stopped at db_read_bytes+0x43: movq0(%rdi),%rax TIDPIDUID PRFLAGS PFLAGS CPU COMMAND 478162 61157 32767 0x1823 0x4002 node *477361 61157 32767 0x1823 0x4003 node 291026 61157 32767 0x1823 0x4001 node db_read_bytes(84f3,8,80002d8427c8) at db_read_bytes+0x43 db_get_value(84f3,8,0) at db_get_value+0x43 db_stack_trace_print(8163851d,0,e,82130d25,817ce280) at db_stack_trace_print+0x2dd db_trap(6,0) at db_trap+0xef db_ktrap(6,0,80002d8429b0) at db_ktrap+0x111 kerntrap(80002d8429b0) at kerntrap+0xa7 alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b end of kernel end trace frame: 0x84fb, count: 8 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{3}> set $maxwidth = 0 ddb{3}> set $lines = 0 ddb{3}> x/s version version:OpenBSD 7.4-current (GENERIC.MP) #1633: Sat Jan 27 08:06:43 MST 2024\012 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP\012 ddb{3}> show panic *cpu3: uvm_fault(0xfd81135c2858, 0x84f3, 0, 1) -> e ddb{3}> trace db_read_bytes(84f3,8,80002d8427c8) at db_read_bytes+0x43 db_get_value(84f3,8,0) at db_get_value+0x43 db_stack_trace_print(8163851d,0,e,82130d25,817ce280) at db_stack_trace_print+0x2dd db_trap(6,0) at db_trap+0xef db_ktrap(6,0,80002d8429b0) at db_ktrap+0x111 kerntrap(80002d8429b0) at kerntrap+0xa7 alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b end of kernel end trace frame: 0x84fb, count: -7 ddb{3}> machine cpuinfo 0: stopped 1: stopped 2: stopped * 3: ddb ddb{3}> show registers rdi 0x84f3__ALIGN_SIZE+0x74f3 rsi 0x8 rbp 0x80002d8427b0 rbx 0xe rdx 0x80002d8427c8 rcx 0x4f rax 0x84fb__ALIGN_SIZE+0x74fb r80x80002d8427e0 r9 0 r10 0x60437e0337a9e3d2 r11 0xcc55262893bf16eb r12 0x8 r130 r14 0x8 r150 rip 0x812ef833db_read_bytes+0x43 cs 0x8 rflags 0x10246__ALIGN_SIZE+0xf246 rsp 0x80002d842790 ss 0 db_read_bytes+0x43: movq0(%rdi),%rax ddb{3}> show proc PROC (node) tid=477361 pid=61157 tcnt=11 stat=onproc flags process=1823 proc=400 runpri=32, usrpri=51, slppri=32, nice=20 wchan=0x0, wmesg=, ps_single=0x0 forw=0x, list=0x80002d6fbaa0,0x80002d74dad0 process=0x8000fffed940 user=0x80002d83d000, vmspace=0xfd81135c2858 estcpu=1, cpticks=5, pctcpu=0.9, user=1, sys=2, intr=0 ddb{3}> ps PID TID PPIDUID S FLAGS WAIT COMMAND 48956 20103 1 0 3 0x18100083 ttyin getty 6441 46980 1 0 3 0x18100098 kqreadcron 40773 359395 35505 32767 3 0x18100083 kqreadtail 35505 66509 26482 32767 3 0x810008b sigsusp sh 61157 134400 26482 32767 3 0x18200083 kqreadnode 61157 182681 26482 32767 3 0x1c200083 kqreadnode 61157 478162 26482 32767 7 0x1c23node 61157 277278 26482 32767 3 0x1c200083 fsleepnode *61157 477361 26482 32767 7 0x1c23node 61157 291026 26482 32767 7
Re: TSO em(4) problem
On Tue, Jan 30, 2024 at 12:07:08PM +0100, Hrvoje Popovski wrote: > On 30.1.2024. 9:27, Hrvoje Popovski wrote: > > I will prepare one box for this kind of traffic and will contact you and > > marcus > > > >> In theory when going through vlan interface it should remove > >> M_VLANTAG. But something must be wrong and I wonder what. > >> > >> bluhm > > Hi, > > I've managed to trigger watchdog in lab. It couldn't be possible without > bluhm@ information about ix vlan, thank you. Great, now we can debug the details. I have to know how ix and em are connected. Do you have any bridge or veb? Where are your vlan trunks? Any aggr, trunk, carp? Is my understanding of your setup corect? ix -> vlan -> forward -> em Can something more happen, like ix -> forward -> em bluhm > Jan 30 12:01:09 smc4 /bsd: em5: watchdog: head 123 tail 187 TDH 187 TDT 123 > Jan 30 12:01:18 smc4 /bsd: em5: watchdog: head 243 tail 307 TDH 307 TDT 243 > Jan 30 12:01:28 smc4 /bsd: em5: watchdog: head 463 tail 15 TDH 15 TDT 463 > Jan 30 12:01:37 smc4 /bsd: em5: watchdog: head 413 tail 477 TDH 477 TDT 413 > Jan 30 12:01:46 smc4 /bsd: em5: watchdog: head 195 tail 259 TDH 259 TDT 195 > Jan 30 12:01:55 smc4 /bsd: em5: watchdog: head 259 tail 323 TDH 323 TDT 259 > Jan 30 12:02:05 smc4 /bsd: em5: watchdog: head 333 tail 397 TDH 397 TDT 333 > Jan 30 12:02:14 smc4 /bsd: em5: watchdog: head 33 tail 97 TDH 97 TDT 33 > Jan 30 12:02:24 smc4 /bsd: em5: watchdog: head 459 tail 11 TDH 11 TDT 459 > Jan 30 12:02:33 smc4 /bsd: em5: watchdog: head 447 tail 511 TDH 511 TDT 447 > > > em0 at pci7 dev 0 function 0 "Intel 82576" rev 0x01: msi, address > 00:1b:21:61:8a:94 > em1 at pci7 dev 0 function 1 "Intel 82576" rev 0x01: msi, address > 00:1b:21:61:8a:95 > em2 at pci8 dev 0 function 0 "Intel I210" rev 0x03: msi, address > 00:25:90:5d:c9:98 > em3 at pci9 dev 0 function 0 "Intel I210" rev 0x03: msi, address > 00:25:90:5d:c9:99 > em4 at pci12 dev 0 function 0 "Intel I350" rev 0x01: msi, address > 00:25:90:5d:c9:9a > em5 at pci12 dev 0 function 1 "Intel I350" rev 0x01: msi, address > 00:25:90:5d:c9:9b > em6 at pci12 dev 0 function 2 "Intel I350" rev 0x01: msi, address > 00:25:90:5d:c9:9c > em7 at pci12 dev 0 function 3 "Intel I350" rev 0x01: msi, address > 00:25:90:5d:c9:9d > > > smc4# netstat -sp tcp | grep LRO > 0 input LRO packets passed through pseudo device > 4696315 input LRO generated packets from hardware > 13205047 input LRO coalesced packets by network device > 0 input bad LRO packets dropped > smc4# netstat -sp tcp | grep TSO > 0 output TSO packets software chopped > 3672 output TSO packets hardware processed > 0 output TSO packets generated > 0 output TSO packets dropped > > > > > smc4# ifconfig em5 hwfeatures > em5: flags=8c43 mtu 1500 > > hwfeatures=31b7 > hardmtu 9216 > lladdr 00:25:90:5d:c9:9b > index 8 priority 0 llprio 3 > media: Ethernet autoselect (1000baseT > full-duplex,master,rxpause,txpause) > status: active > inet 192.168.20.1 netmask 0xff00 broadcast 192.168.20.255 >
pfsync in 7.4 generating much larger amount of traffic than 7.3
Hello, I've been trying to track down what exactly is causing such a large increase in traffic on pfsync(4) interfaces on several of the firewall pairs I've upgraded to 7.4. All have recent errata applied. custedge1$ doas syspatch -l 002_msplit 003_patch 004_ospfd 005_tmux 006_httpd 007_perl 008_vmm 009_pf 011_ssh I've seen both CPU graphs and network graphs jump greatly since the upgrades, by about 3-4x in some cases. I've included some data in-line, but due to production nature of these for pcap data and pfctl -ss -vv outputs I sent directly to dlg. The issue wasn't noticed immediately, and even now for the most part everything is performing 'OK' in that resources behind these firewalls are loading fine and latency is low. This has happened on a few virtualized fw pairs running on proxmox nodes with virtio-backed vio(4) interfaces that tend to be like so: - one vio for 'outside' - one vio for 'inside' - one vio for pfsync/also ssh pf conf changes to secondary All generally have lots of vlan/carp interfaces. However, I also have a physical pair of fw that use em(4) and the pfsync there is also showing these same symptoms, linked directly to each other with a single cable. Here is a bwm-ng output of an affected host, the pfsync interface is yarding around data at a higher rate vs every other interface (this time of day is a quieter time) bwm-ng v0.6.3 (probing every 0.500s), press 'h' for help input: getifaddrs type: rate | iface Rx TxTotal == lo0: 0.00 b/s0.00 b/s0.00 b/s vio0: 630.04 kb/s1.13 Mb/s1.76 Mb/s vio1: 27.78 Mb/s 28.12 Mb/s 55.90 Mb/s vio2: 1.47 Mb/s 199.67 kb/s1.67 Mb/s carp2100: 0.00 b/s1.10 kb/s1.10 kb/s carp600: 0.00 b/s1.10 kb/s1.10 kb/s carp601: 1.10 kb/s1.10 kb/s2.19 kb/s carp602: 0.00 b/s1.10 kb/s1.10 kb/s carp605: 0.00 b/s1.10 kb/s1.10 kb/s carp607: 16.05 kb/s1.10 kb/s 17.14 kb/s carp612: 0.00 b/s1.10 kb/s1.10 kb/s carp614: 0.00 b/s1.10 kb/s1.10 kb/s carp615: 0.00 b/s1.10 kb/s1.10 kb/s carp99: 1.06 Mb/s1.10 kb/s1.06 Mb/s pfsync0: 25.35 Mb/s 27.17 Mb/s 52.52 Mb/s vlan2100: 0.00 b/s1.10 kb/s1.10 kb/s vlan600: 0.00 b/s1.10 kb/s1.10 kb/s vlan601: 1.10 kb/s1.10 kb/s2.19 kb/s vlan602: 0.00 b/s1.10 kb/s1.10 kb/s vlan605: 0.00 b/s1.10 kb/s1.10 kb/s vlan607: 16.05 kb/s 82.25 kb/s 98.30 kb/s vlan612: 0.00 b/s1.10 kb/s1.10 kb/s vlan614: 0.00 b/s1.10 kb/s1.10 kb/s vlan615: 0.00 b/s1.10 kb/s1.10 kb/s vlan99: 1.06 Mb/s 102.76 kb/s1.16 Mb/s lo1: 0.00 b/s0.00 b/s0.00 b/s pflog0: 0.00 b/s3.32 kb/s3.32 kb/s vlan616: 222.81 kb/s1.12 Mb/s1.35 Mb/s carp616: 192.00 kb/s1.10 kb/s 193.10 kb/s -- total: 57.80 Mb/s 57.95 Mb/s 115.75 Mb/s custedge1$ ifconfig vio1 vio1: flags=8843 mtu 1500 lladdr 82:7b:8a:de:50:6e index 2 priority 0 llprio 3 media: Ethernet autoselect status: active inet 10.100.1.5 netmask 0xfffc broadcast 10.100.1.7 custedge1$ ifconfig pfsync0 pfsync0: flags=41 mtu 1500 index 18 priority 0 llprio 3 encap: parent vio1 pfsync: syncdev: vio1 syncpeer: 10.100.1.6 maxupd: 128 defer: off groups: carp pfsync I've verified maxupd is the same on both sides. I recently switched to using syncpeer to see if that would lessen the load, it does appear to have helped but I am suspecting over time the amount of duplicates (maybe?) increases linearly. It seemed like downing the pfsync if and then upping it 'reset' the issue for awhile, at least with much lower than what I saw the other day (>100mbit.. on
Re: TSO em(4) problem
On 30.1.2024. 9:27, Hrvoje Popovski wrote: > I will prepare one box for this kind of traffic and will contact you and > marcus > >> In theory when going through vlan interface it should remove >> M_VLANTAG. But something must be wrong and I wonder what. >> >> bluhm Hi, I've managed to trigger watchdog in lab. It couldn't be possible without bluhm@ information about ix vlan, thank you. Jan 30 12:01:09 smc4 /bsd: em5: watchdog: head 123 tail 187 TDH 187 TDT 123 Jan 30 12:01:18 smc4 /bsd: em5: watchdog: head 243 tail 307 TDH 307 TDT 243 Jan 30 12:01:28 smc4 /bsd: em5: watchdog: head 463 tail 15 TDH 15 TDT 463 Jan 30 12:01:37 smc4 /bsd: em5: watchdog: head 413 tail 477 TDH 477 TDT 413 Jan 30 12:01:46 smc4 /bsd: em5: watchdog: head 195 tail 259 TDH 259 TDT 195 Jan 30 12:01:55 smc4 /bsd: em5: watchdog: head 259 tail 323 TDH 323 TDT 259 Jan 30 12:02:05 smc4 /bsd: em5: watchdog: head 333 tail 397 TDH 397 TDT 333 Jan 30 12:02:14 smc4 /bsd: em5: watchdog: head 33 tail 97 TDH 97 TDT 33 Jan 30 12:02:24 smc4 /bsd: em5: watchdog: head 459 tail 11 TDH 11 TDT 459 Jan 30 12:02:33 smc4 /bsd: em5: watchdog: head 447 tail 511 TDH 511 TDT 447 em0 at pci7 dev 0 function 0 "Intel 82576" rev 0x01: msi, address 00:1b:21:61:8a:94 em1 at pci7 dev 0 function 1 "Intel 82576" rev 0x01: msi, address 00:1b:21:61:8a:95 em2 at pci8 dev 0 function 0 "Intel I210" rev 0x03: msi, address 00:25:90:5d:c9:98 em3 at pci9 dev 0 function 0 "Intel I210" rev 0x03: msi, address 00:25:90:5d:c9:99 em4 at pci12 dev 0 function 0 "Intel I350" rev 0x01: msi, address 00:25:90:5d:c9:9a em5 at pci12 dev 0 function 1 "Intel I350" rev 0x01: msi, address 00:25:90:5d:c9:9b em6 at pci12 dev 0 function 2 "Intel I350" rev 0x01: msi, address 00:25:90:5d:c9:9c em7 at pci12 dev 0 function 3 "Intel I350" rev 0x01: msi, address 00:25:90:5d:c9:9d smc4# netstat -sp tcp | grep LRO 0 input LRO packets passed through pseudo device 4696315 input LRO generated packets from hardware 13205047 input LRO coalesced packets by network device 0 input bad LRO packets dropped smc4# netstat -sp tcp | grep TSO 0 output TSO packets software chopped 3672 output TSO packets hardware processed 0 output TSO packets generated 0 output TSO packets dropped smc4# ifconfig em5 hwfeatures em5: flags=8c43 mtu 1500 hwfeatures=31b7 hardmtu 9216 lladdr 00:25:90:5d:c9:9b index 8 priority 0 llprio 3 media: Ethernet autoselect (1000baseT full-duplex,master,rxpause,txpause) status: active inet 192.168.20.1 netmask 0xff00 broadcast 192.168.20.255
Re: TSO em(4) problem
On 29.1.2024. 15:29, Alexander Bluhm wrote: > On Sat, Jan 27, 2024 at 08:08:35AM +0100, Hrvoje Popovski wrote: >> On 26.1.2024. 22:47, Alexander Bluhm wrote: >>> On Fri, Jan 26, 2024 at 11:41:49AM +0100, Hrvoje Popovski wrote: I've manage to reproduce TSO em problem on anoter setup, unfortunatly production. >>> What helped debugging a similar issue with ixl(4) and TSO was to >>> remove all TSO specific code from the driver. Then only this part >>> remains from the original em(4) TSO diff. >>> >>> error = bus_dmamap_create(sc->sc_dmat, EM_TSO_SIZE, >>> EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1), >>> EM_TSO_SEG_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map); >>> >>> The parameters that changed when adding TSO are: >>> >>> bus_size_t size:MAX_JUMBO_FRAME_SIZE 16128 -> EM_TSO_SIZE 65535 >>> bus_size_t maxsegsz:MAX_JUMBO_FRAME_SIZE 16128 -> EM_TSO_SEG_SIZE >>> 4096 >>> >>> I suspect that this is the cause for the regression as disabling >>> TSO did not help. Would it be possible to run the diff below? I >>> expect that the problem will still be there. But then we know it >>> must be the change of one of the bus_dmamap_create() arguments. >>> >>> bluhm >> >> Hi, >> >> with this diff em0 seems happy and em watchdog is gone. > > This is very interesting. That means that the bus_dmamap_create() > argument does not cause the regression. > > Did you see anywhere "output TSO packets hardware processed in" > netstat -s. In some iteration of testing you turned TSO off with > sysctl net.inet.tcp.tso=0, but it did not help. So no TSO packets > from the stack. > > In another mail you mentioned > >> Setup is very simple >> em0 - carp <- uplink >> em1 - pfsync >> ix1 - vlans - carp > > ix supports LRO. If you forward from ix1 to em0 the LRO packets > from ix hardware are split by TSO on em hardware. And the ix does > vlan offloading + LRO, so em must do vlan offloading properly with > TSO. Or do you use a vlan interface? > > Does it help to disable LRO, ifconfig ix1 -tcplro ? Yes, it helps... Thank you uplink em0: flags=8b43 mtu 1500 hwfeatures=31b7 hardmtu 9216 lladdr 0c:c4:7a:da:cd:5a index 3 priority 0 llprio 3 groups: egress media: Ethernet autoselect (1000baseT full-duplex,master,rxpause) status: active vlans are on ix1 - I've disabled LRO ix1: flags=8b43 mtu 1500 lladdr 90:e2:ba:d7:1b:f5 index 2 priority 0 llprio 3 media: Ethernet autoselect (10GbaseSR full-duplex,rxpause,txpause) status: active before I've disabled LRO on ix1 I've got lot of watchdog on em0 bcbnfw1# uptime 9:25AM up 8 mins, 1 user, load averages: 0.14, 0.13, 0.06 bcbnfw1# cat /var/log/messages| grep watchdog Jan 30 09:18:51 bcbnfw1 /bsd: em0: watchdog: head 148 tail 213 TDH 213 TDT 148 Jan 30 09:19:01 bcbnfw1 /bsd: em0: watchdog: head 160 tail 224 TDH 224 TDT 160 Jan 30 09:19:12 bcbnfw1 /bsd: em0: watchdog: head 163 tail 228 TDH 228 TDT 163 Jan 30 09:19:22 bcbnfw1 /bsd: em0: watchdog: head 128 tail 192 TDH 192 TDT 128 Jan 30 09:19:32 bcbnfw1 /bsd: em0: watchdog: head 309 tail 373 TDH 373 TDT 309 Jan 30 09:19:41 bcbnfw1 /bsd: em0: watchdog: head 113 tail 177 TDH 177 TDT 113 Jan 30 09:19:51 bcbnfw1 /bsd: em0: watchdog: head 402 tail 466 TDH 466 TDT 402 Jan 30 09:20:01 bcbnfw1 /bsd: em0: watchdog: head 114 tail 178 TDH 178 TDT 114 Jan 30 09:20:16 bcbnfw1 /bsd: em0: watchdog: head 111 tail 175 TDH 175 TDT 111 Jan 30 09:20:26 bcbnfw1 /bsd: em0: watchdog: head 199 tail 263 TDH 263 TDT 199 without LRO on ix1 everything seems to work just fine ... > > I see this vlan code with mac_type checks. Can we end in a > configuration where we enable TSO but cannot do VLAN offloading? > > #if NVLAN > 0 > /* Find out if we are in VLAN mode */ > if (m->m_flags & M_VLANTAG && (sc->hw.mac_type < em_82575 || > sc->hw.mac_type > em_i210)) { > /* Set the VLAN id */ > desc->upper.fields.special = htole16(m->m_pkthdr.ether_vtag); > > /* Tell hardware to add tag */ > desc->lower.data |= htole32(E1000_TXD_CMD_VLE); > } > #endif > > Hrvoje, I know you do great tests in your lab. Did you try this > setup: > > Send bulk TCP traffic in vlan that will trigger LRO. > Do VLAN + LRO offloading in ix. > Forward it to em with TSO. > I will prepare one box for this kind of traffic and will contact you and marcus > In theory when going through vlan interface it should remove > M_VLANTAG. But something must be wrong and I wonder what. > > bluhm >
Re: OpenBSD 7.4/amd64 on APU4D4 - kernel panic
> "uhidev" not "unidev", There was a typo in my previous email, but there is a correct entry in /etc/bsd.re-config. I removed "disable upd" from /etc/bsd.re-config and now my UPS is handled by ugen. ugen0 at uhub0 port 4 "American Power Conversion Smart-UPS 2200 FW:UPS 09.3 / ID=18" rev 2.00/1.06 addr 2 # cat /etc/bsd.re-config disable uhidev Thanks! On Sun, 28 Jan 2024 18:18:04 + Stuart Henderson wrote: > On 2024/01/28 15:14, Radek wrote: > > In this case UPS is handled by uhid but I need it to be handeled by ugen. > > How can I make it work without rebuilding the kernel? > > Disabling unidev in /etc/bsd.re-config doesn't work in this case. > > "uhidev" not "unidev", > > > On Sat, 27 Jan 2024 18:48:48 + > > Stuart Henderson wrote: > > > > > On 2024/01/27 10:36, Radek wrote: > > > > but it doesn't work for another APC UPS hardware > > > > > > > uhidev0 at uhub0 port 3 configuration 1 interface 0 "American Power > > > > Conversion Smart-UPS 2200 FW:UPS 09.3 / ID=18" rev 2.00/1.06 addr 2 > > > > > > ... > > > > > > > > > On 2024-01-19 13:21, Stuart Henderson wrote: > > > > > > > ...actually, maybe it needs to be "disable uhidev". > > > > > > ^^ > > > > > > > > > Radek > > > Radek