You should check for updated on-board NIC firmware as well as upgrading your tg3 driver to the latest version.
Your problem is almost certainly an issue where the tg3 driver stops picking up traffic from the broadcom buffers. When the buffers fill without a driver request to flush them, the NIC's passthrough to the BMC will begin to fail. Broadcom added in their firmware and driver a NIC watchdog, where the driver notifies the firmware that the driver is loaded and will heartbeat ever so often. If the kernel panics and/or tg3 goes out to lunch, the firmware will see the missed heartbeats and change the NIC to operate as if no driver is loaded to maintain passthrough. Another workaround if this isn't working or is not palatable or workable is to set up the BMC watchdog timer, so that if the kernel panics, the system will either turn off or reset so you preserve pass-through to the BMC. I know on IBM systems we've paid a lot of attention to this and understand very well how to deal with kernel panics and the like with respect to the systems management, but it unfortunately requires that if you use a NIC driver (almost certainly you would want to), it must implement this watchdog facility, so you need newer tg3. The BMC watchdog is the solution that will overcome the situation regardless of the NIC driver loaded prior to a panic. We use the same bcm5721 chips on x336 and x346 and I have positively confirmed that given the right firmware/driver combination, panic of a system does not knock out system access. Look for FWCMD_NICDRV_ALIVE somewhere in your tg3, and it should have something near it like /* Heartbeat is only sent once every 120 seconds. */. If so the tg3 has the bit needed to check in with the watchdog, just make sure your vendor gives you the latest NIC firmware to go with it. If your vendor is unable to provide you with NIC firmware to help you overcome this, buy from IBM and fund my salary ;) On Wed, 2006-10-04 at 15:28 +0100, Colin Keith wrote: > Hi, > > I'm still trying to work out my IPMI problems and I was wondering if anyone > could make suggestions on the following. I have a server out at a data > center its a dual AMD 254 Operton box on a Tyan m/b with Broadcom NetXtreme > BCM5721 nics (using tg3 driver). It was shipped from Penguin (and since I > see them on this list occassionally I assume that they tested all of this > stuff and know it works :). > > I can access the box via IPMI over the LAN without problems when it is up > and running, but for some reason it is falling down every so few weeks > (running FC5 and it seems to be a problem with the aacraid driver in 2.6.17 > ..). When it does I can't get any commands through at all, including a > reset, which was our entire reason for getting the IPMI support on the box. > > # ipmitool -I open bmc info > Device ID : 32 > Device Revision : 1 > Firmware Revision : 1.6 > IPMI Version : 2.0 > Manufacturer ID : 20569 > Manufacturer Name : Unknown (0x5059) > Product ID : 17 (0x0011) > Device Available : yes > Provides Device SDRs : yes > Additional Device Support : > Sensor Device > SDR Repository Device > SEL Device > FRU Inventory Device > IPMB Event Receiver > IPMB Event Generator > Aux Firmware Rev Info : > 0x00 > 0x00 > 0x00 > 0x00 > > LAN channel: > > # ipmitool -I open channel info 6 > Channel 0x6 info: > Channel Medium Type : 802.3 LAN > Channel Protocol Type : IPMB-1.0 > Session Support : multi-session > Active Session Count : 0 > Protocol Vendor ID : 7154 > Volatile(active) Settings > Alerting : enabled > Per-message Auth : enabled > User Level Auth : enabled > Access Mode : always available > Non-Volatile Settings > Alerting : enabled > Per-message Auth : enabled > User Level Auth : enabled > Access Mode : always available > > # ipmitool -I open lan print 6 > Set in Progress : Set Complete > Auth Type Support : NONE MD2 MD5 PASSWORD > Auth Type Enable : Callback : NONE MD2 MD5 PASSWORD > : User : NONE MD2 MD5 PASSWORD > : Operator : NONE MD2 MD5 PASSWORD > : Admin : NONE MD2 MD5 PASSWORD > : OEM : NONE MD2 MD5 PASSWORD > IP Address Source : Static Address > IP Address : 38.x.y.z > Subnet Mask : 255.255.255.224 > MAC Address : 00:a0:d1:e1:e6:24 > SNMP Community String : XXXXXXX > IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10 > BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Enabled > Gratituous ARP Intrvl : 2.0 seconds > Default Gateway IP : 38.x.y.z > Default Gateway MAC : 00:16:46:7d:f6:00 > Backup Gateway IP : 38.x.y.z > Backup Gateway MAC : 00:03:47:84:c4:3f > 802.1q VLAN ID : Disabled > 802.1q VLAN Priority : 0 > RMCP+ Cipher Suites : 0,1 > Cipher Suite Priv Max : Not Available > > # ipmitool -H XXXX -I lanplus -U XXXXX -f /root/.ipmipass -C AES-CBC-128 > # user list 6 > ID Name Callin Link Auth IPMI Msg Channel Priv Limit > 1 true false true ADMINISTRATOR > 2 XXXXX true false true ADMINISTRATOR > 3 QRSTUVWXYZ123456 true false true ADMINISTRATOR > 4 7890-=&*()_+ true false true ADMINISTRATOR > > > While it was down I would send a command from a box on the same LAN > segment and get: > > # ipmitool -H XXX -I lan -U XXXXX -f /root/.ipmipass -C AES-CBC-128 -vv # > chassis status > ipmi_lan_send_cmd:opened=[0], open=[134767402] > IPMI LAN host apollo port 623 > Sending IPMI/RMCP presence ping packet > ipmi_lan_send_cmd:opened=[1], open=[134767402] > No response from remote controller > Get Auth Capabilities command failed > ipmi_lan_send_cmd:opened=[1], open=[134767402] > > # ipmitool -V > ipmitool version 1.8.8 > > (both boxes) > > I have the BMC set to send gratuitous ARP's every 2s. I now have another > box out there too and I can see the IPMI enabled server sending these > packets out, even when it was down. I had the ARP info hard coded on the > other box to ensure that wasn't a problem too, but I wasn't able to get > responses back from the box. As there was no response from the remote > server it seems most likely that it wasn't picking up the data. It is, > I suppose, possible that it was processing the request and responding and > for some reason those packets didn't reach my other server because the > responses contained invalid info, but I saw no evidence of any other > traffic on the box I was sending the commands from so I don't think that > this is likely - plus it responds without problem when the OS is running. > > Questions: > > I previously had the "IP Source Address" no the LAN channel set to the > default of "Unspecified". I've since changed it to "static". Would this > have made a difference? > > I share the BMC IP with the server's IP. This isn't a problem when the OS > is running or when the box is plugged in but not turned on. Could this be > a problem if the OS crashes though? > > Does anyone know, from the manufacturer ID, who/where I should be looking > for firmware updates? Also if anyone has this same manufacturer, do you use > any of the special work arounds? (the "-o" switch in the ipmitool command) > Could you post your config/say if you have this problem as well or not. > > While I'm at it, does anyone have a list of the hex codes for the raw > commands? > > I can't for the life of me work out why IPMI works if the box is up and > running the OS, or if it is powered down and it can then be powered up > again, but if FC5 crashes then I can't get any response from the IPMI > controller. > > Thanks in advance for any suggestions. > > Colin. > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Ipmitool-devel mailing list Ipmitool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ipmitool-devel