rivanwang wrote:
Hi,
I have one workstation(hp xw4300) , with Solaris 10 (x86) and one Digi Sync570i
card.
The system may hangs at any time, from a few minutes to a couple of hours, when
the card is receiving data frames.
I doubt the system hanging is caused by the driver module for Sync570, however,
the same driver works properly on solaris 8 system.
We used to install Solaris 8 on HP xw4100, but now we have to install Solaris
10 on HP xw4300.(we cant get HP xw4100 in the market)
I use kmdb to load solaris system. After the system hangs I can't ping the
host. And the keyboard and mouse have no reponses.
I can get the crashdump file by pressing "F1+A" and then input "$<systemdump".
By analysing the crashdump file , I can't find such problems as 'mutex
deadlock' and 'bad trap'.
I really don't know what to do next step !
# The crashdump files can be downloaded from the following URLs :
# www.ras.com.cn/rivanwang/crash_4.tar.gz
# www.ras.com.cn/rivanwang/crash_8.tar.gz
# www.ras.com.cn/rivanwang/crash_7_nor.tar.gz
# "crash_7_nor.tar.gz" is generated before system hanging happens.
Yes, because it's a hang issue and you force the crash dump by yourself.
You may find a badtrap from ::msgbuf and it's a side effect of
$<systemdump.
I have some questions as follows.
Would you be so kind as to give me some suggestions?
[[ Q1 ]]
I can't find the kernel thread reponding to Sync570 module by using the command
"threadlist -v".
But I can get the LOADADDR:
::modinfo !grep Sync
161 feba4340 cc60 1 dsync (Sync/570 Device Driver)
How can I find the address of the kernel thread reponding to Sync570 module ?
If you know the function name of Sync570, you can log all of kernel
thread into a text file, and check that file.
::log /a.txt
mdb: logging to "/a.txt"
::threadlist -v
then grep the a.txt with the pattern of your driver function name.
[[ Q2 ]]
::msgbuf
panic[cpu0]/thread=d2c84de0:
BAD TRAP: type=e (#pf Page fault) rp=d2c84cec addr=0 occurred in module
"<unknown>" due to a NULL pointer dereference
sched:
#pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0x0, sp=0x202, eflags=0x10002
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0 cr3: 4226000
gs: 1b0 fs: 0 es: 160 ds: 160
edi: d2f50a60 esi: fef4b2a8 ebp: d2c84d34 esp: d2c84d1c
ebx: d2f54180 edx: d2f541f8 ecx: 1f eax: fed6c870
trp: e err: 10 eip: 0 cs: 158
efl: 10002 usp: 202 ss: d2c84d3c
d2c84c4c unix:die+a7 (e, d2c84cec, 0, 0)
d2c84cd8 unix:trap+f56 (d2c84cec, 0, 0)
d2c84cec unix:cmntrap+83 ()
d2c84d34 0 (d2c84d44, fe81189a,)
d2c84d3c genunix:kdi_dvec_enter+a (d2c84d50, fe81183c,)
d2c84d44 unix:debug_enter+32 (0)
d2c84d50 unix:abort_sequence_enter+27 (0)
d2c84d64 kbtrans:kbtrans_streams_key+3e (d2f54180, 1f, 0)
d2c84d88 kb8042:kb8042_received_byte+b2 (fef4b1a8, 1e)
d2c84da0 kb8042:kb8042_intr+65 (fef4b1a8)
d2c84db8 i8042:i8042_intr+a4 (d2f50980)
----------------------------------------------------------------------------
::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fec20ae4 1b 8 0 104 no no t-740847 d2c84de0 sched
| | |
RUNNING <--+ | +--> PIL THREAD
READY | 5 d2c84de0
EXISTS | 3 d2ca0de0
ENABLE | - d2c28de0 (idle)
|
+--> PRI THREAD PROC
99 d2c9ade0 sched
99 d2c97de0 sched
60 d3264a00 fsflush
60 d2e1ade0 sched
60 d2e37de0 sched
60 d4644de0 sched
60 d96dcde0 sched
59 d38e7400 Xsun
d2c84de0::thread
ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR DISPTIME BOUND PR
d2c84de0 onproc 809 0 3 104 0 5 d2ca0de0 0 -1 2
d2ca0de0::thread
ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR DISPTIME BOUND PR
d2ca0de0 onproc 9 0 3 102 0 3 d2c28de0 46a51 -1 1
d2ca0de0::findstack -v
stack pointer for thread d2ca0de0: d2ca0c2c
d2ca0de0 0xd94c62bc()
----------------------------------------------------------------------------
After I pressed "F1+A",the kernel created the thread "d2c84de0" to give
responses to keyboard interruption(PIL = 5, PRI= 104).
but another thread "d2ca0de0",at same time, is still running on CPU. ( PIL = 3
, PRI = 102 ).
I guess one event may causes the kernel to create the thread d2ca0de0 , but then the
kernel hangs, until I have pressed "F1+A" , the kernel creates another thead
d2c84de0 , and finally crashed down.
I have no idea what causes the kernel to create thread d2ca0de0 (PRI=102,PIL=3)?
You shouldn't guess. We already can see a big tick number(740847) on
CPU0, that might indicate something wrong. For example, if you entering
kmdb via tip line, instead of keyboard, it's quite possible it just use
the original thread stack on CPU0, that means you should look into the
stack of thread: d2c84de0.
Sometimes, due to stack pointer changes by entering kmdb, you can't get
the back trace by findstack dcmd. You have to check stack by invoking
like <address of stack pointer>, 200/nap.
In this case, the stack pointer for thread d2ca0de0 is d2ca0c2c, you can
check the thread with "d2ca0c2c,200/nap"
Anyway, we saw the interrupt thread d2c84de0 pinned another interrupt
thread d2ca0de0 on CPU0, you should check both interrupt thread stack
and make sure it's not related to your driver.
It seems solaris 10 doesn't support ::interrupts dcmds, but you really
need to know, your driver's interrupt-priorities is what value, maybe
you can check it in your driver configuration file.
[[ Q3 ]]
::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fec20ae4 1b 8 0 104 no no t-740847 d2c84de0 sched
::cycinfo -v
CPU CYC_CPU STATE NELEMS ROOT FIRE HANDLER
0 d9aabe00 online 4 d9aabd80 96b6b848e80 clock
2
|
+------------------+------------------+
0 1
| |
+---------+--------+ +---------+---------+
3
|
+----+----+
ADDR NDX HEAP LEVL PEND FIRE USECINT HANDLER
d9aabd80 0 1 high 0 96b6b848e80 10000 cbe_hres_tick
d9aabda0 1 2 low 741253 96b6b848e80 10000 apic_redistribute_compute
d9aabdc0 2 0 lock 406 96b6b848e80 10000 clock
d9aabde0 3 3 high 0 96b6d4e5200 1000000 deadman
-----------------------------------------------------------------------------------
The value of SWITCH of thread d2c84de0 is 740847 ;
It means that thread hung on CPU0 about 740847ticks (1tick=10ms), for a
interrupt thread, it shouldn't hold a CPU for a such long lime.
The value of PEND of apic_redistribute_compute is 741253 ;
The value of PEND of clock is 406 .
(741253 - 406) == 740847
What does it mean ? Could you please account for it ?
The PEND of clock just indicate the times of clock routine was not
handled. It's quite possible if your ran into system hang.
[[ Q4 ]]
::ipcs
Message queues:
failed to read 'msq_svc'; module not present
Shared memory:
ADDR REF ID KEY MODE PRJID ZONEID OWNER GROUP CREAT CGRP
d4915f50 1 3 103 0666 3 0 1002 102 1002 102
d3f0b090 1 2 101 0666 3 0 0 0 0 0
d3f0b2c0 1 1 102 0666 3 0 1002 102 1002 102
d3f0bbf0 1 0 100 0666 3 0 1002 102 1002 102
Semaphores:
ADDR REF ID KEY MODE PRJID ZONEID OWNER GROUP CREAT CGRP
d4915ee0 3 3 103 0666 3 0 1002 102 1002 102
d3f0b1e0 3 2 101 0666 3 0 0 0 0 0
d3f0b250 4 1 102 0666 3 0 1002 102 1002 102
d3f0bb80 7 0 100 0666 3 0 1002 102 1002 102
-------------------------------------------
I dont know what threads are accessing to the semaphore "d3f0b1e0" ?
How can I find these unkown threads?
I have no idea about it. But I think it was not related to your hang issue.
::msgbuf
MESSAGE
/[EMAIL PROTECTED],0/pci103c,[EMAIL PROTECTED],2 (uhci2): failed to attach
pcplusmp: pciclass,0c0300 (uhci) instance 3 vector 0x16 ioapic 0x1 intin 0x16 is
bound to cpu 0
/[EMAIL PROTECTED],0/pci103c,[EMAIL PROTECTED],3 (uhci3): failed to attach
cpu0: x86 (GenuineIntel family 15 model 4 step 10 clock 3000 MHz)
cpu0: Intel(r) Pentium(r) 4 CPU 3.00GHz
panic[cpu0]/thread=d2c84de0:
BAD TRAP: type=e (#pf Page fault) rp=d2c84cec addr=0 occurred in module "<unknow
n>" due to a NULL pointer dereference
sched:
#pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0x0, sp=0x202, eflags=0x10002
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0 cr3: 4226000
gs: 1b0 fs: 0 es: 160 ds: 160
edi: d2f50a60 esi: fef4b2a8 ebp: d2c84d34 esp: d2c84d1c
ebx: d2f54180 edx: d2f541f8 ecx: 1f eax: fed6c870
trp: e err: 10 eip: 0 cs: 158
efl: 10002 usp: 202 ss: d2c84d3c
d2c84c4c unix:die+a7 (e, d2c84cec, 0, 0)
d2c84cd8 unix:trap+f56 (d2c84cec, 0, 0)
d2c84cec unix:cmntrap+83 ()
d2c84d34 0 (d2c84d44, fe81189a,)
d2c84d3c genunix:kdi_dvec_enter+a (d2c84d50, fe81183c,)
d2c84d44 unix:debug_enter+32 (0)
d2c84d50 unix:abort_sequence_enter+27 (0)
d2c84d64 kbtrans:kbtrans_streams_key+3e (d2f54180, 1f, 0)
d2c84d88 kb8042:kb8042_received_byte+b2 (fef4b1a8, 1e)
d2c84da0 kb8042:kb8042_intr+65 (fef4b1a8)
d2c84db8 i8042:i8042_intr+a4 (d2f50980)
syncing file systems...
2
2
done
dumping to /dev/dsk/c1d0s1, offset 429391872, content: kernel
From out put of back trace, I found the thread d2c84de0 should be
keyboard interrupt thread as you mentioned, but you really need to look
into all of interrupts thread stack on each CPU and all of thread stack
on the run queue. Anyway, you should find out why system hung or looks
like a hang?
You also need make sure the system doesn't have memory resource issue
by invoking ::memstat and ::kmastat.
If you possible, you also can get ACT(Automated Crash Tool) tools from
sunsolve site, that can give you a automatic analysis base on your crash
dump files.
It seems you come from China, maybe you can read my recent Chinese blog
about how to debug a hang issue on Solaris . In this blog, it gives a
real case about debugging hang issue on Solaris e1000g driver:
http://blog.csdn.net/yayong/archive/2007/03/04/1520604.aspx
Hopefully it could give your some helps.
--
Cheers,
----------------------------------------------------------------------
Oliver Yang | [EMAIL PROTECTED] | x82229 | Work from office
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code