Carson Tan wrote:
Hi Minskey and Kyle,

Thanks for all your discussion on this.

I found that the SOL session is gone after executing the following code in bge_asf_pre_reset_operations:
bge_reg_put32(bgep, RX_RISC_EVENT_REG, event | RRER_ASF_EVENT);

Hi,

I've never done driver development, so if I'm way off base just say so....

The code line you quote above is part of:

 5759   event = bge_reg_get32 
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep,
 RX_RISC_EVENT_REG 
<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>);
  5760  bge_reg_put32 
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,
 RX_RISC_EVENT_REG <http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>, 
event | RRER_ASF_EVENT <http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>);


Is this section of code atomic?
Can the HW change the register on it's own?

The failure is 100% reproducible, and not intermittent, so I normally wouldn't consider a race condition right away, but it occurred to me that any changes to the register between the get and the put would be lost by this code.

Poking around, I also noticed this function:

   574 *void*
   575 bge_reg_set32 <http://src.opensolaris.org/source/s?refs=bge_reg_set32&project=/onnv>(bge_t 
<http://src.opensolaris.org/source/s?defs=bge_t&project=/onnv> *bgep, bge_regno_t 
<http://src.opensolaris.org/source/s?defs=bge_regno_t&project=/onnv> regno, uint32_t 
<http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv> bits 
<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>)
   576 {
   577  uint32_t <http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv> regval 
<http://src.opensolaris.org/source/s?refs=regval&project=/onnv>;
578 579 BGE_TRACE <http://src.opensolaris.org/source/s?defs=BGE_TRACE&project=/onnv>(("bge_reg_set32($%p, 0x%lx, 0x%x)",
   580      (*void* *)bgep, regno, bits 
<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>));
581 582 regval = bge_reg_get32 <http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep, regno);
   583  regval |= bits 
<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>;
   584  bge_reg_put32 
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,
 regno, regval);
   585 }
586

I don't know if it would be any better protected than the existing code above, but it seems like the code above could have been re-written as:

bge_reg_set32 
<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,
 RX_RISC_EVENT_REG <http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>, 
RRER_ASF_EVENT <http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>);


Am I missing something?


Also I noticed several parts of bge_main2.c (line 634) and bge_chip2.c (lines 4367,4714)that specifically mention problems with the IBM BladeCenter HS20 blade. Nothing discussed there seemed immediately obvious to me, but since you said the code in the area that triggers the disconnect hasn't changed since S10, I'm wondering if any of these areas that mention the HS20 have changed since S10?

Maybe a problem created by a change in one of them doesn't rear it's head until we get to the code we're all looking at?

 -Kyle

_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Reply via email to