4.6/4.7 crashes during boot while probing serial on DELL Poweredge 750 + Sunix PCI 4 port serial, and a fix!

Martin van den Nieuwelaar Tue, 06 Jul 2010 17:22:06 -0700

Hi people,

This is my first OpenBSD bug report; hopefully I have included enough of 
the right info!


We have a number of DELL Poweredge 750 rack mount servers.  We have 
successfully used Sunix 2 port PCI serial cards in these machines.  
Swapping the 2 port cards for 4 port cards however results in OpenBSD 
crashing during boot.  Not just on one of these machines either, I am 
told.  We have tried the 4 port cards in other brands/models of OpenBSD 
machine and they work fine.  The conclusion is that it is somehow the 
combination of hardware that is causing the issue.  OpenBSD 4.6 displays 
the issue, and so does 4.7.  All my output below is taken from 4.7.

I took a photo of the screen during a typical crash (attached).  In case 
for some reason the attachment doesn't come through I'll type part of it 
out:

com4 at puc0 port 1 apic 1 int 22 (irq 6): ti16750, 64 byte fifo
com4: probed fifo depth: 32 bytes
com5 at puc0 port 2 apic 1 int 22 (irq 6): ti16750, 64 byte fifo
Stopped at      bus_space_read_1+0x13:   movzbl   %al,%eax
bus_space_read_1(0,dc80,5,cd) at bus_space_read_1+0x13
com_fifo_probe(d1b60000,dc80,3,0) at com_fifo_probe+0xc0
...
ddb>

When I type 'show panic' (following report guidelines) I get 'kernel did 
not panic' back.  'ps' only shows the swapper - presumably we're still 
too early in the boot process to see anything else.  I found that if I 
typed 'cont' a few times the system would actually boot.

I obtained the kernel source and compiled and installed it.  It produces 
the output you see above.  I then located the call to com_fifo_probe() 
which is in dev/ic/com.c.  Inside that routine I put in some printf() 
statements and traced the crash point to the following code:

        for (len = 0; len < 256; len++) {
                bus_space_write_1(iot, ioh, com_data, (len + 1));
                timo = 2000;
                while (!ISSET(bus_space_read_1(iot, ioh, com_lsr),
                    LSR_TXRDY) && --timo)
                        delay(1);
                if (!timo)
                        break;
        }

While the crash occurs inside bus_space_read_1() I was not convinced 
that routine was necessarily at fault - it certainly gets called from 
lots of different routines in the kernel.  com_fifo_probe() on the other 
hand is only called when probing the FIFO for the serial card.  In any 
case here is the object code for bus_space_read_1():

00000000 <bus_space_read_1>:
bus_space_read_1():
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   8b 4d 08                mov    0x8(%ebp),%ecx
   6:   85 c9                   test   %ecx,%ecx
   8:   8b 55 0c                mov    0xc(%ebp),%edx
   b:   8b 45 10                mov    0x10(%ebp),%eax
   e:   75 08                   jne    18 <bus_space_read_1+0x18>
  10:   01 c2                   add    %eax,%edx
  12:   ec                      in     (%dx),%al
  13:   0f b6 c0                movzbl %al,%eax    <<<----- crash point
  16:   c9                      leave 
  17:   c3                      ret   
  18:   8a 04 10                mov    (%eax,%edx,1),%al
  1b:   eb f6                   jmp    13 <bus_space_read_1+0x13>
  1d:   8d 76 00                lea    0x0(%esi),%esi

I don't know x86 assembler, so cannot comment on whether this looks 
right or wrong.

Going back to com_fifo_probe() I didn't like the look of the delay(1) 
and on a hunch that this is somehow a timing issue I increased the delay 
to 1000.  This resulted in a successful boot.  Yippee!  I then brought 
the delay down until the problem showed up again.  Here's my new code:

        for (len = 0; len < 256; len++) {
                bus_space_write_1(iot, ioh, com_data, (len + 1));
                timo = 2000;
                while (!ISSET(bus_space_read_1(iot, ioh, com_lsr),
                    LSR_TXRDY) && --timo)
// 4, 8, 16 not enough 1000 enough 512, 128, 32 too
                        delay(32); // MJV
                if (!timo)
                        break;
        }

While a delay(1) will consistently crash, larger values are 
progressively 'more reliable'.  delay(4) often crashes, but delay(16) 
usually boots fine - but after perhaps 20 boots it had crashed (I set 
crontab to reboot on a regular basis so I could test its reliability).  
So now I'm using delay(32) and for now that seems to be 'enough'.  Here 
is the dmesg from a successful boot:

# dmesg                                                                
OpenBSD 4.7 (GENERIC) #22: Wed Jul  7 23:19:50 NZST 2010
    [email protected]:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Intel(R) Pentium(R) 4 CPU 2.80GHz ("GenuineIntel" 686-class) 2.81 GHz
cpu0: 
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,SSE3,MWAIT,DS-CPL,CNXT-ID,xTPR
real mem  = 1073053696 (1023MB)
avail mem = 1030983680 (983MB)
mainbus0 at root
bios0 at mainbus0: AT/286+ BIOS, date 08/03/04, BIOS32 rev. 0 @ 0xffe90, 
SMBIOS rev. 2.3 @ 0xfb030 (83 entries)
bios0: vendor Dell Computer Corporation version "A02" date 08/03/2004
bios0: Dell Computer Corporation PowerEdge 750
acpi0 at bios0: rev 0
acpi0: tables DSDT FACP APIC SPCR
acpi0: wakeup devices PCI0(S5) PCI1(S5) PCI2(S5) PCI3(S5)
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: apic clock running at 200MHz
ioapic0 at mainbus0: apid 1 pa 0xfec00000, version 20, 24 pins
ioapic0: misconfigured as apic 0, remapped to apid 1
ioapic1 at mainbus0: apid 2 pa 0xfec10000, version 20, 24 pins
ioapic1: misconfigured as apic 0, remapped to apid 2
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus 3 (PCI1)
acpiprt2 at acpi0: bus 2 (PCI2)
acpiprt3 at acpi0: bus 1 (PCI3)
acpicpu0 at acpi0
bios0: ROM list: 0xc0000/0x8000 0xc8000/0x1000 0xec000/0x4000!
pci0 at mainbus0 bus 0: configuration mode 1 (bios)
pchb0 at pci0 dev 0 function 0 "Intel 82875P Host" rev 0x02
ppb0 at pci0 dev 3 function 0 "Intel 82875P CSA" rev 0x02
pci1 at ppb0 bus 1
em0 at pci1 dev 1 function 0 "Intel PRO/1000CT (82547GI)" rev 0x00: apic 
1 int 18 (irq 10), address 00:0f:1f:f7:62:b0
ppb1 at pci0 dev 28 function 0 "Intel 6300ESB PCIX" rev 0x02
pci2 at ppb1 bus 2
ppb2 at pci0 dev 30 function 0 "Intel 82801BA Hub-to-PCI" rev 0x0a
pci3 at ppb2 bus 3
em1 at pci3 dev 2 function 0 "Intel PRO/1000MT (82541GI)" rev 0x00: apic 
1 int 21 (irq 7), address 00:0f:1f:f7:62:b1
puc0 at pci3 dev 3 function 0 "Sunix 40XX" rev 0x01: ports: 4 com
com3 at puc0 port 0 apic 1 int 22 (irq 6): ti16750, 64 byte fifo
com3: probed fifo depth: 32 bytes
com4 at puc0 port 1 apic 1 int 22 (irq 6): ti16750, 64 byte fifo
com4: probed fifo depth: 32 bytes
com5 at puc0 port 2 apic 1 int 22 (irq 6): ti16750, 64 byte fifo
com5: probed fifo depth: 32 bytes
com6 at puc0 port 3 apic 1 int 22 (irq 6): ti16750, 64 byte fifo
com6: probed fifo depth: 32 bytes
vga1 at pci3 dev 14 function 0 "ATI Rage XL" rev 0x27
wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
ichpcib0 at pci0 dev 31 function 0 "Intel 6300ESB LPC" rev 0x02
pciide0 at pci0 dev 31 function 2 "Intel 6300ESB SATA" rev 0x02: DMA, 
channel 0 configured to compatibility, channel 1 configured to compatibility
atapiscsi0 at pciide0 channel 0 drive 0
scsibus0 at atapiscsi0: 2 targets
cd0 at scsibus0 targ 0 lun 0: <SAMSUNG, CD-ROM SN-124, N104> ATAPI 
5/cdrom removable
cd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2
wd0 at pciide0 channel 1 drive 0: <ST340014AS>
wd0: 16-sector PIO, LBA48, 38146MB, 78125000 sectors
wd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 6
ichiic0 at pci0 dev 31 function 3 "Intel 6300ESB SMBus" rev 0x02: SMBus 
disabled
isa0 at ichpcib0
isadma0 at isa0
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pmsi0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pmsi0 mux 0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: <PC speaker>
spkr0 at pcppi0
npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
mtrr: Pentium Pro MTRR support
vscsi0 at root
scsibus1 at vscsi0: 256 targets
softraid0 at root
root on wd0a swap on wd0b dump on wd0b
#

I should say at this point that the machine boots reliably and the 
serial ports also seem to work reliably.  My fix, while stopping the 
machine from crashing during boot may well not be the 'right' fix.  I 
have little to no idea what is supposed to be happening at such a low 
level on the bus, nor do I know what sort of timings are involved, and 
what 'should' work.  It's possible that out of spec. hardware is to 
blame, but I think OpenBSD should not crash even if the hardware is 
'slightly' out of spec.

It would be great if someone who is familiar with this part of the 
kernel would have a look and get to the real bottom of this issue.  The 
difficulty here is that I doubt anyone else will be able to experience 
the problem without having this hardware combination.

Oh, I did spend a few hours looking for people who have had similar 
problems, but didn't find a single match.

Regards,

-Martin


-- 
R A Ward Ltd. | We take the privacy of our customers seriously.
Christchurch  | All sensitive E-Mail attachments MUST be encrypted.
New Zealand

[demime 1.01d removed an attachment of type image/jpeg which had a name of 
openbsd_crash.JPG]

4.6/4.7 crashes during boot while probing serial on DELL Poweredge 750 + Sunix PCI 4 port serial, and a fix!

Reply via email to