rivanwang wrote:
Hi,

I have one workstation(hp xw4300) , with Solaris 10 (x86) and one Digi Sync570i 
card.
The system may hangs at any time, from a few minutes to a couple of hours, when 
the card is receiving data frames.

I doubt the system hanging is caused by the driver module for Sync570, however, 
the same driver  works properly on solaris 8 system.
We used to install Solaris 8 on HP xw4100, but now we have to install Solaris 
10 on HP xw4300.(we cant get HP xw4100 in the market)

I use kmdb to load solaris system. After the system hangs I can't ping the 
host. And the keyboard and mouse have no reponses.
I can get the crashdump file by pressing "F1+A" and then input "$<systemdump".

By analysing the crashdump file , I can't find such problems as 'mutex 
deadlock' and 'bad trap'.
I really don't know what to do next step !

# The crashdump files can be downloaded from the following URLs :
#     www.ras.com.cn/rivanwang/crash_4.tar.gz
#     www.ras.com.cn/rivanwang/crash_8.tar.gz
#     www.ras.com.cn/rivanwang/crash_7_nor.tar.gz
# "crash_7_nor.tar.gz" is generated before system hanging happens.


Yes, because it's a hang issue and you force the crash dump by yourself. You may find a badtrap from ::msgbuf and it's a side effect of $<systemdump.
I have some questions as follows.
Would you be so kind as to give me some suggestions?

[[ Q1 ]]

I can't find the kernel thread reponding to Sync570 module by using the command 
"threadlist -v".
But I can get the LOADADDR:
 ::modinfo !grep Sync
161 feba4340     cc60   1 dsync (Sync/570 Device Driver)

How can I find the address of the kernel thread reponding to Sync570 module ?


If you know the function name of Sync570, you can log all of kernel thread into a text file, and check that file.

::log /a.txt
mdb: logging to "/a.txt"

::threadlist -v

then grep the a.txt with the pattern of your driver function name.

[[ Q2 ]]
::msgbuf
panic[cpu0]/thread=d2c84de0:
BAD TRAP: type=e (#pf Page fault) rp=d2c84cec addr=0 occurred in module 
"<unknown>" due to a NULL pointer dereference

sched:
#pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0x0, sp=0x202, eflags=0x10002
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0 cr3: 4226000
         gs:      1b0  fs:        0  es:      160  ds:      160
        edi: d2f50a60 esi: fef4b2a8 ebp: d2c84d34 esp: d2c84d1c
        ebx: d2f54180 edx: d2f541f8 ecx:       1f eax: fed6c870
        trp:        e err:       10 eip:        0  cs:      158
        efl:    10002 usp:      202  ss: d2c84d3c
d2c84c4c unix:die+a7 (e, d2c84cec, 0, 0)
d2c84cd8 unix:trap+f56 (d2c84cec, 0, 0)
d2c84cec unix:cmntrap+83 ()
d2c84d34 0 (d2c84d44, fe81189a,)
d2c84d3c genunix:kdi_dvec_enter+a (d2c84d50, fe81183c,)
d2c84d44 unix:debug_enter+32 (0)
d2c84d50 unix:abort_sequence_enter+27 (0)
d2c84d64 kbtrans:kbtrans_streams_key+3e (d2f54180, 1f, 0)
d2c84d88 kb8042:kb8042_received_byte+b2 (fef4b1a8, 1e)
d2c84da0 kb8042:kb8042_intr+65 (fef4b1a8)
d2c84db8 i8042:i8042_intr+a4 (d2f50980)

----------------------------------------------------------------------------
 ::cpuinfo -v
ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
  0 fec20ae4  1b    8    0 104   no    no t-740847 d2c84de0 sched
               |    |    |
    RUNNING <--+    |    +--> PIL THREAD
      READY         |           5 d2c84de0
     EXISTS         |           3 d2ca0de0
     ENABLE         |           - d2c28de0 (idle)
                    |
                    +-->  PRI THREAD   PROC
                           99 d2c9ade0 sched
                           99 d2c97de0 sched
                           60 d3264a00 fsflush
                           60 d2e1ade0 sched
                           60 d2e37de0 sched
                           60 d4644de0 sched
                           60 d96dcde0 sched
                           59 d38e7400 Xsun
  d2c84de0::thread
    ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL     INTR DISPTIME BOUND PR
d2c84de0 onproc    809    0    3   104     0   5 d2ca0de0        0    -1  2
  d2ca0de0::thread
    ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL     INTR DISPTIME BOUND PR
d2ca0de0 onproc      9    0    3   102     0   3 d2c28de0    46a51    -1  1
  d2ca0de0::findstack -v
stack pointer for thread d2ca0de0: d2ca0c2c
  d2ca0de0 0xd94c62bc()

----------------------------------------------------------------------------

  After I pressed "F1+A",the kernel created the thread "d2c84de0" to give 
responses to keyboard interruption(PIL = 5, PRI= 104).
but another thread "d2ca0de0",at same time, is still running on CPU. ( PIL = 3 
, PRI = 102 ).
  I guess one event may causes the kernel to create the thread d2ca0de0 , but then the 
kernel hangs,  until I have pressed "F1+A" , the kernel creates another thead 
d2c84de0 , and finally crashed down.

I have no idea what causes the kernel to create thread d2ca0de0 (PRI=102,PIL=3)?
You shouldn't guess. We already can see a big tick number(740847) on CPU0, that might indicate something wrong. For example, if you entering kmdb via tip line, instead of keyboard, it's quite possible it just use the original thread stack on CPU0, that means you should look into the stack of thread: d2c84de0.

Sometimes, due to stack pointer changes by entering kmdb, you can't get the back trace by findstack dcmd. You have to check stack by invoking like <address of stack pointer>, 200/nap.

In this case, the stack pointer for thread d2ca0de0 is d2ca0c2c, you can check the thread with "d2ca0c2c,200/nap"

Anyway, we saw the interrupt thread d2c84de0 pinned another interrupt thread d2ca0de0 on CPU0, you should check both interrupt thread stack and make sure it's not related to your driver.

It seems solaris 10 doesn't support ::interrupts dcmds, but you really need to know, your driver's interrupt-priorities is what value, maybe you can check it in your driver configuration file.





[[ Q3 ]]
  ::cpuinfo
 ID ADDR     FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD   PROC
  0 fec20ae4  1b    8    0 104   no    no t-740847 d2c84de0 sched
  ::cycinfo -v
CPU  CYC_CPU   STATE NELEMS     ROOT            FIRE HANDLER
  0 d9aabe00  online      4 d9aabd80     96b6b848e80 clock

                                       2
                                       |
                    +------------------+------------------+
                    0                                     1
                    |                                     |
          +---------+--------+                  +---------+---------+
          3
          |
     +----+----+

      ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
  d9aabd80   0    1 high     0     96b6b848e80   10000 cbe_hres_tick
  d9aabda0   1    2  low 741253    96b6b848e80   10000 apic_redistribute_compute
  d9aabdc0   2    0 lock   406     96b6b848e80   10000 clock
  d9aabde0   3    3 high     0     96b6d4e5200 1000000 deadman

-----------------------------------------------------------------------------------
The value of SWITCH of thread d2c84de0  is 740847 ;
It means that thread hung on CPU0 about 740847ticks (1tick=10ms), for a interrupt thread, it shouldn't hold a CPU for a such long lime.
The value of PEND of apic_redistribute_compute is 741253 ;
The value of PEND of clock is 406 .
(741253 - 406) == 740847 What does it mean ? Could you please account for it ?
The PEND of clock just indicate the times of clock routine was not handled. It's quite possible if your ran into system hang.


[[ Q4 ]]
::ipcs
Message queues:
failed to read 'msq_svc'; module not present

Shared memory:
    ADDR   REF    ID      KEY  MODE PRJID ZONEID OWNER GROUP CREAT  CGRP
d4915f50     1     3      103  0666     3      0  1002   102  1002   102
d3f0b090     1     2      101  0666     3      0     0     0     0     0
d3f0b2c0     1     1      102  0666     3      0  1002   102  1002   102
d3f0bbf0     1     0      100  0666     3      0  1002   102  1002   102

Semaphores:
    ADDR   REF    ID      KEY  MODE PRJID ZONEID OWNER GROUP CREAT  CGRP
d4915ee0     3     3      103  0666     3      0  1002   102  1002   102
d3f0b1e0     3     2      101  0666     3      0     0     0     0     0
d3f0b250     4     1      102  0666     3      0  1002   102  1002   102
d3f0bb80     7     0      100  0666     3      0  1002   102  1002   102
-------------------------------------------
I dont know what threads are accessing to the semaphore "d3f0b1e0" ?
How can I find these unkown threads?

I have no idea about it. But I think it was not related to your hang issue.




  ::msgbuf
MESSAGE /[EMAIL PROTECTED],0/pci103c,[EMAIL PROTECTED],2 (uhci2): failed to attach
pcplusmp: pciclass,0c0300 (uhci) instance 3 vector 0x16 ioapic 0x1 intin 0x16 is
 bound to cpu 0
/[EMAIL PROTECTED],0/pci103c,[EMAIL PROTECTED],3 (uhci3): failed to attach
cpu0: x86 (GenuineIntel family 15 model 4 step 10 clock 3000 MHz)
cpu0: Intel(r) Pentium(r) 4 CPU 3.00GHz


panic[cpu0]/thread=d2c84de0: BAD TRAP: type=e (#pf Page fault) rp=d2c84cec addr=0 occurred in module "<unknow
n>" due to a NULL pointer dereference


sched: #pf Page fault
Bad kernel fault at addr=0x0
pid=0, pc=0x0, sp=0x202, eflags=0x10002
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0 cr3: 4226000
         gs:      1b0  fs:        0  es:      160  ds:      160
        edi: d2f50a60 esi: fef4b2a8 ebp: d2c84d34 esp: d2c84d1c
        ebx: d2f54180 edx: d2f541f8 ecx:       1f eax: fed6c870
        trp:        e err:       10 eip:        0  cs:      158
        efl:    10002 usp:      202  ss: d2c84d3c

d2c84c4c unix:die+a7 (e, d2c84cec, 0, 0)
d2c84cd8 unix:trap+f56 (d2c84cec, 0, 0)
d2c84cec unix:cmntrap+83 ()
d2c84d34 0 (d2c84d44, fe81189a,)
d2c84d3c genunix:kdi_dvec_enter+a (d2c84d50, fe81183c,)
d2c84d44 unix:debug_enter+32 (0)
d2c84d50 unix:abort_sequence_enter+27 (0)
d2c84d64 kbtrans:kbtrans_streams_key+3e (d2f54180, 1f, 0)
d2c84d88 kb8042:kb8042_received_byte+b2 (fef4b1a8, 1e)
d2c84da0 kb8042:kb8042_intr+65 (fef4b1a8)
d2c84db8 i8042:i8042_intr+a4 (d2f50980)

syncing file systems...
 2
 2
 done
dumping to /dev/dsk/c1d0s1, offset 429391872, content: kernel
From out put of back trace, I found the thread d2c84de0 should be keyboard interrupt thread as you mentioned, but you really need to look into all of interrupts thread stack on each CPU and all of thread stack on the run queue. Anyway, you should find out why system hung or looks like a hang?

You also need make sure the system doesn't have memory resource issue by invoking ::memstat and ::kmastat.

If you possible, you also can get ACT(Automated Crash Tool) tools from sunsolve site, that can give you a automatic analysis base on your crash dump files.

It seems you come from China, maybe you can read my recent Chinese blog about how to debug a hang issue on Solaris . In this blog, it gives a real case about debugging hang issue on Solaris e1000g driver:

http://blog.csdn.net/yayong/archive/2007/03/04/1520604.aspx

Hopefully it could give your some helps.

--
Cheers,

----------------------------------------------------------------------
Oliver Yang | [EMAIL PROTECTED] | x82229 | Work from office

_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to