Hi Kyle,
Thanks for your time.
You don't need to recompile the driver, instead you can add the
following line at the end of /kernel/drv/bge.conf
bge-debug-flags = 0xffffffff;
And you can check the meaning of each bit used in the flag in the
bge_impl.h file (line 1024 - 1057).
I have lost control of my blade, and asked Jonathan to restore it. The
document has not been received yet. Sorry
for the delay.
If you have enough time and it's not too much trouble, you may try some
previous builds of Nevada to see if it work
well (such as snv_86, snv_100, etc).
Thanks again,
Carson
Kyle McDonald wrote:
Carson Tan wrote:
Hi Kyle,
Thanks for your great effort on this. It really make sense, and I
really appreciate it.
Hi again,
I have some time now, so I thought I'd keep looking into this. There
is one thing I wonder if you can help with?
In the code I see calls to Macros like:
595 BGE_TRACE
<http://src.opensolaris.org/source/s?defs=BGE_TRACE&project=/onnv>(("bge_reg_clr32($%p,
0x%lx, 0x%x)",
596 (*void* *)bgep, regno, bits
<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>));
How can I activate the debugging info that these Macro's provide? Do I
need to recompile the driver? or is there a value I can set in the
bge.conf file?
-Kyle
I have checked your disassembly of the disconnection point, and it
looks the same as mine. But
it's hard to say what's the root cause right now. I am still waiting
for the document from IBM.
Meanwhile, I am trying to find out which previous build of Nevada
works well, as that will be
much easier for me to find the differences.
Any update, I will let you know.
Thanks again,
Carson
Kyle McDonald wrote:
Carson Tan wrote:
Hi Minskey and Kyle,
Thanks for all your discussion on this.
I found that the SOL session is gone after executing the following
code in bge_asf_pre_reset_operations:
bge_reg_put32(bgep, RX_RISC_EVENT_REG, event | RRER_ASF_EVENT);
Hi,
I've never done driver development, so if I'm way off base just say
so....
The code line you quote above is part of:
5759 event = bge_reg_get32
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep,
RX_RISC_EVENT_REG
<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>);
5760 bge_reg_put32
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,
RX_RISC_EVENT_REG
<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>,
event | RRER_ASF_EVENT
<http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>);
Is this section of code atomic?
Can the HW change the register on it's own?
The failure is 100% reproducible, and not intermittent, so I
normally wouldn't consider a race condition right away, but it
occurred to me that any changes to the register between the get and
the put would be lost by this code.
Poking around, I also noticed this function:
574 *void*
575 bge_reg_set32
<http://src.opensolaris.org/source/s?refs=bge_reg_set32&project=/onnv>(bge_t
<http://src.opensolaris.org/source/s?defs=bge_t&project=/onnv>
*bgep, bge_regno_t
<http://src.opensolaris.org/source/s?defs=bge_regno_t&project=/onnv>
regno, uint32_t
<http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv>
bits <http://src.opensolaris.org/source/s?defs=bits&project=/onnv>)
576 {
577 uint32_t
<http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv>
regval <http://src.opensolaris.org/source/s?refs=regval&project=/onnv>;
578 579 BGE_TRACE
<http://src.opensolaris.org/source/s?defs=BGE_TRACE&project=/onnv>(("bge_reg_set32($%p,
0x%lx, 0x%x)",
580 (*void* *)bgep, regno, bits
<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>));
581 582 regval = bge_reg_get32
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep,
regno);
583 regval |= bits
<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>;
584 bge_reg_put32
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,
regno, regval);
585 }
586
I don't know if it would be any better protected than the existing
code above, but it seems like the code above could have been
re-written as:
bge_reg_set32
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,
RX_RISC_EVENT_REG
<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>,
RRER_ASF_EVENT
<http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>);
Am I missing something?
Also I noticed several parts of bge_main2.c (line 634) and
bge_chip2.c (lines 4367,4714)that specifically mention problems with
the IBM BladeCenter HS20 blade. Nothing discussed there seemed
immediately obvious to me, but since you said the code in the area
that triggers the disconnect hasn't changed since S10, I'm wondering
if any of these areas that mention the HS20 have changed since S10?
Maybe a problem created by a change in one of them doesn't rear it's
head until we get to the code we're all looking at?
-Kyle
--
Thanks and Regards,
Carson (Yong Tan)
Sun Microsystems China (ERI)
Email: yong....@sun.com
Tel : (86-10)6267-3681 (x51681)
_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss