Re: [driver-discuss] Driver trouble on Nevada...

Carson Tan Fri, 04 Dec 2009 00:43:05 -0800

Hi Kyle,

I am sorry, but I still haven't got the document from IBM. I found theproblem doesn't exist on build_81.

So I think maybe I can find some clues by checking the difference.

By the way, my console connection to the HS20 failed with the followingmessage:

system:blade[4]> console
SOL is not ready
system:blade[4]> sol
-status enabled
SOL Session: Not Ready
SOL retry interval: 250 ms
SOL retry count: 3
SOL bytes sent: 0
SOL bytes received: 0
SOL destination IP address: 192.199.199.84
SOL destination MAC: 00:00:00:00:00:00
SOL I/O module slot number: 0
SOL console user ID:
SOL console login from:
SOL console session started: 11/23/09 -- 21:49:24
SOL console session stopped: 11/26/09 -- 01:46:47
Blade power state: On

SOL recommended action: Internal network path between the AMM and thisblade server is currently not available. Please refer to AMM user guidefor troubleshooting information.


Do you know how to bring it back?

Thanks,
Carson


Kyle McDonald wrote:

Hi Carson,

Any news on this?
I think I've found another related (possibly the same?) problem ondifferent hardware, that might make it easier to debug.
I also have a bunch of IBM xSeries 346 servers. They're about the sameage and very similiar Hardware wise to the HS20, though not the samebroadcom chip (5716 I think?)
Anyway on these I use the physical Serial port fo rthe console, so Inever made the connection till now to what is happenning on theHS20's. But I was thinking about it recently, and I started to wonder.I use IPMI tool to reboot alot of these all the time. I do alot ofJumpStart, and AI, Kickstart testing, so I'm always rebooting them -but for a long while now when ever they're booted all the way intoNevada, the BMC becomes unreachable, and IPMItool can't do anythingwith them. Reboot, and while the BIOS is in control during boot upIPMItool works fine, but again once Nevada boots it's like the BMC hasgone to sleep again.
I'll have to boot it through the debugger. Maybe I can get some moredata points that will help zero in on this problem.
 -Kyle


Kyle McDonald wrote:
Carson Tan wrote:
Hi Kyle,
Thanks for your great effort on this. It really make sense, and Ireally appreciate it.
I have checked your disassembly of the disconnection point, and itlooks the same as mine. Butit's hard to say what's the root cause right now. I am still waitingfor the document from IBM.Meanwhile, I am trying to find out which previous build of Nevadaworks well, as that will be
much easier for me to find the differences.

Any update, I will let you know.
Just for curiosity's sake, I stepped through the same section of codeon S10u8 (output below) and wouldn't you know it, it did disconnect.So, while I'm not sure what it means, I think it's telling ussomething. Why on S10 would it work when running freely outside thedebugger, and disconnect when stepping through the code? And in NV itdisconnects inside and out of the debugger?
All I can think of is a timing issue. Something in the timing of S10allows it to avoid disconnecting when running full tilt?
Anyway, it's food for thought.

 -Kyle



S10u8 booting:
ucode0 is /pseudo/uc...@0
pseudo-device: fssnap0
fssnap0 is /pseudo/fss...@0
pseudo-device: winlock0
winlock0 is /pseudo/winl...@0
pseudo-device: vol0
vol0 is /pseudo/v...@0
pseudo-device: pm0
pm0 is /pseudo/p...@0
pseudo-device: rsm0
rsm0 is /pseudo/r...@0
pseudo-device: pool0
pool0 is /pseudo/p...@0
Hostname: Einstein03
dump on /dev/zvol/dsk/zroot0/dump size 4096 MB
NIS domain name is Engineering.NIS
Loaded modules: [ crypto cpc ptm ufs sppp lofs logindmux md random ]
kmdb: stop at bge`bge_attach
kmdb: target stopped at:
bge`bge_attach: pushq  %rbp
[3]> bge_asf_pre_reset_operations:b
[3]> :c
kmdb: stop at bge`bge_asf_pre_reset_operations
kmdb: stop at bge`bge_asf_pre_reset_operations
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations:       pushq  %rbp
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+1:     movl   $0x2,%edx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+6:     movq   %rsp,%rbp
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+9:     pushq  %r13
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0xb:   movl   %esi,%r13d
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0xe:   movl   $0xb78,%esi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x13:  pushq  %r12
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x15:  movq   %rdi,%r12
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x18:  pushq  %rbx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x19:  xorl   %ebx,%ebx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x1b:  subq   $0x8,%rsp
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x1f: call -0x448f<bge`bge_nic_put32>
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x24:  movl   $0x6810,%esi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x29:  movq   %r12,%rdi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x2c: call -0x480c<bge`bge_reg_get32>
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x31:  movl   %eax,%edx
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x33:  movl   $0x6810,%esi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x38:  movq   %r12,%rdi
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x3b:  orb    $0x40,%dh
[3]> ::step over
kmdb: target stopped at:
bge`bge_asf_pre_reset_operations+0x3e: call -0x47fe<bge`bge_reg_put32>
[3]>
system> console -T blade[3]       SOL is not ready
system>
Thanks again,
Carson


Kyle McDonald wrote:
Carson Tan wrote:
Hi Minskey and Kyle,

Thanks for all your discussion on this.
I found that the SOL session is gone after executing the followingcode in bge_asf_pre_reset_operations:
bge_reg_put32(bgep, RX_RISC_EVENT_REG, event | RRER_ASF_EVENT);
Hi,
I've never done driver development, so if I'm way off base just sayso....
The code line you quote above is part of:
5759 event = bge_reg_get32<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep,RX_RISC_EVENT_REG<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>);5760 bge_reg_put32<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,RX_RISC_EVENT_REG<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>,event | RRER_ASF_EVENT<http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>);
Is this section of code atomic?
Can the HW change the register on it's own?
The failure is 100% reproducible, and not intermittent, so Inormally wouldn't consider a race condition right away, but itoccurred to me that any changes to the register between the get andthe put would be lost by this code.
Poking around, I also noticed this function:

   574 *void*
575 bge_reg_set32<http://src.opensolaris.org/source/s?refs=bge_reg_set32&project=/onnv>(bge_t<http://src.opensolaris.org/source/s?defs=bge_t&project=/onnv>*bgep, bge_regno_t<http://src.opensolaris.org/source/s?defs=bge_regno_t&project=/onnv>regno, uint32_t<http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv>bits <http://src.opensolaris.org/source/s?defs=bits&project=/onnv>)
   576 {
577 uint32_t<http://src.opensolaris.org/source/s?defs=uint32_t&project=/onnv>regval<http://src.opensolaris.org/source/s?refs=regval&project=/onnv>;578 579 BGE_TRACE<http://src.opensolaris.org/source/s?defs=BGE_TRACE&project=/onnv>(("bge_reg_set32($%p,0x%lx, 0x%x)",580 (*void* *)bgep, regno, bits<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>));581 582 regval = bge_reg_get32<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_get32>(bgep,regno);583 regval |= bits<http://src.opensolaris.org/source/s?defs=bits&project=/onnv>;584 bge_reg_put32<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,regno, regval);
   585 }
   586
I don't know if it would be any better protected than the existingcode above, but it seems like the code above could have beenre-written as:
bge_reg_set32<http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bge/bge_chip2.c#bge_reg_put32>(bgep,RX_RISC_EVENT_REG<http://src.opensolaris.org/source/s?defs=RX_RISC_EVENT_REG&project=/onnv>,RRER_ASF_EVENT<http://src.opensolaris.org/source/s?defs=RRER_ASF_EVENT&project=/onnv>);
Am I missing something?
Also I noticed several parts of bge_main2.c (line 634) andbge_chip2.c (lines 4367,4714)that specifically mention problemswith the IBM BladeCenter HS20 blade. Nothing discussed there seemedimmediately obvious to me, but since you said the code in the areathat triggers the disconnect hasn't changed since S10, I'mwondering if any of these areas that mention the HS20 have changedsince S10?
Maybe a problem created by a change in one of them doesn't rearit's head until we get to the code we're all looking at?
 -Kyle



--
Thanks and Regards,
Carson (Yong Tan)
Sun Microsystems China (ERI)
Email: yong....@sun.com

Tel : (86-10)6267-3681 (x51681)

_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Re: [driver-discuss] Driver trouble on Nevada...

Reply via email to