RE: SuperMicro X5DP8-G2MB/(2)XEON 2.4/1GB RAM 5.4-S Freeze
From: Marc Olzheim On Mon, Apr 11, 2005 at 10:12:32PM -0400, Aaron Summers wrote: We have a SuperMicro X5DP8-G2 Motherboard, 2xXEON 2.4, 1GB RAM server running 5.4-STABLE that keeps freezing up. We have replaced RAM, HD, SCSI controller, etc. To no avail. We are running SMP GENERIC Kernel. I cannot get the system to panic, leave a core dump, etc. It just always freezes. The server functions as a web server in a HSphere Cluster. I am about out of options besides loading 4.11 (since our 4 series servers never die). Any help, feedback, clues, similar experiences, etc would be greatly appreciated. On SCSI: The onboard Adaptec 7902 gives a dump on bootup but appears to work. I read the archived post about this issue. The system still locked up with an Adaptec 7982B that did not give this message. The problem is with the periodic SMM interrupt and the bios. The attached program (ich-periodic-smm-disable.c) will fix the problem. For more information on what it does, see the Intel ICH3 datasheet. compile as 'gcc ich-periodic-smm-disable.c; ./a.out' and you will be good. Run this on each boot. I think you only need to clear PERIODIC_EN. --don ich-periodic-smm-disable.c Description: ich-periodic-smm-disable.c ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] ... As far as I understand this family of controllers the OS drivers aren't involved at all in case of a disk drive failure. It's strictly the controller's business to deal with it internally. The OS just sits there and waits until the controller is done with the retries and either drops into degraded mode or recovers from the disk error. That's why I initially speculated that there might be a timeout somewhere in PostgreSQL or FreeBSD that leads to data loss if the controller is busy for too long. A somewhat radical way to at least make these failures as rare an event as possible would be to deliberately fail all remaining old disk drives, one after the other of course, in order to get rid of them. And if you are lucky the problem won't happen with newer drives anyway, in case the root cause is an incompatibility between the controller and the old drives. Started that yesterday. I've got one 'old' one left. Sadly, the one that failed night before last was not one of the 'old' ones, so this is no guarantee :) From the raidutil -e log, I see this type of info. I'm not sure what the 'unknown' events are. The 'CRC Failure' is probably the problem? There's also Bad SCSI Status, unit attention, etc. Perhaps the driver doesn't deal with these properly? $ raidutil -e d0 03/31/2005 23:37:59 Level 1 Lock for Channel 0 : Started 03/31/2005 23:37:59 Level 1 Lock for Channel 1 : Started 03/31/2005 23:38:09 Level 1 Lock for Channel 0 : Stopped 03/31/2005 23:38:22 Level 1 Lock for Channel 1 : Stopped 03/31/2005 23:38:22 Level 4 HBA=0 BUS=0 ID=0 LUN=0 Status Change Optimal = Degraded - Drive Failed 03/31/2005 23:38:22 Level 1 Unknown Event : 56 10 00 08 EE 89 4C 42 00 00 00 00 03/31/2005 23:38:22 Level 1 CRC Failure Number of dirty blocks = -1 D30A1F2A 03/31/2005 23:38:24 Level 3 HBA=0 BUS=0 ID=0 LUN=0 Bad SCSI Status - Check Condition 28 00 00 00 00 00 00 00 01 00 00 00 03/31/2005 23:38:24 Level 3 HBA=0 BUS=0 ID=0 LUN=0 Request Sense 70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00 Unit Attention ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: [EMAIL PROTECTED] From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. Another drive failed and the same thing happened. After the failure, the raid worked in degrade mode just fine, but many files had been corrupted during the failure. So I would suggest that this merge did not help, and the cam timeout did not help either. This is very frustrating, again I rebuild my postgresql install from backup :( --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] Don Bowman wrote: From: [EMAIL PROTECTED] From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. Another drive failed and the same thing happened. After the failure, the raid worked in degrade mode just fine, but many files had been corrupted during the failure. So I would suggest that this merge did not help, and the cam timeout did not help either. This is very frustrating, again I rebuild my postgresql install from backup :( This is indeed unfortunate. Maybe the problem is in fact located neither in PostgreSQL nor in FreeBSD but in the controller itself. Does it have the latest firmware? The necessary files should be available on Adaptec's website, and you can use the 'raidutil' program under FreeBSD to upload the firmware to the controller. I have to concede, however, that I never did this under FreeBSD myself. If I recall correctly I did the upload via a DOS diskette the last time. If this doesn't help either you could ask Adaptec's support for help. You need to register the controller first, if memory serves. The latest firmware bios is in the controller (upgraded the last time I had problems). Tried adaptec support, controller is registered. The problem is definitely not in postgresql. Files go missing in directories that are having new entries added (e.g. I lost a 'PG_VERSION' file). Data within the postgresql files becomes corrupt. Since the only application running is postgresql, and it reads/writes/fsyncs the data, its not unexpected that it's the one that reaps the 'rewards' of the failure. I have to believe this is either a bug in the controller, or a problem in cam or asr. --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] ... Did you merge 1.3.2.3 as well? This actually should have been one MFC Yes, merged from RELENG_4. I will post later if this happens again, but it will be quite a long time. The machine has 7 drives in it, there are only 3 ones left old enough they might fail before I take it out of service (it originally had 7 1999-era IBM drives, now it has 4 2004-era seagate drives and 3 of the old IBM's. The drives have been in continuous service, so they've lead a pretty good life!) Thanks for the suggestion on the cam timeout, I've set that value. --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Adaptec 3210S, 4.9-STABLE, corruption when disk fails
I have a machine running: $ uname -a FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0: Fri Mar 19 10:39:07 EST 2004 [EMAIL PROTECTED]:/usr/src/sys/compile/LABDB i386 It has an adaptec 3210S raid controller running a single raid-5, and runs postgresql 7.4.6 as its primary application. 3 times now I have had a drive fail, and have had corrupted files in the postgresql cluster @ the same time. The time is too closely correlated to be a coincidence. It passes fsck @ the time that I got to it a couple of hours later, and the filesystem seems to be ok (with a failed drive, the raid in 'degrade' mode). It appears that the drive failure and the postgresql failure occur @ exactly the same time (monitoring with nagios, within 1hr accuracy). It would appear that for some file(s) bad data was returned. Does anyone have any suggestions? $ raidutil -L all RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FWNVRAM Serial Status --- d0 -- -- ADAP3210S 16MB 370F ADPT 1.0 BF0A21700J7Optimal Physical View AddressType Manufacturer/Model Capacity Status --- d0b0t0d0 Disk Drive (DASD) SEAGATE ST318453LW17501MB Optimal d0b0t1d0 Disk Drive (DASD) SEAGATE ST318453LW17501MB Optimal d0b0t2d0 Disk Drive (DASD) IBM DNES-318350W 17501MB Optimal d0b1t3d0 Disk Drive (DASD) IBM DNES-318350W 17501MB Optimal d0b1t4d0 Disk Drive (DASD) SEAGATE ST318452LW17501MB Optimal d0b1t5d0 Disk Drive (DASD) IBM DNES-318350W 17501MB Optimal ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: Adaptec 3210S, 4.9-STABLE, corruption when disk fails
From: Uwe Doering [mailto:[EMAIL PROTECTED] Don Bowman wrote: I have a machine running: $ uname -a FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0: Fri Mar 19 10:39:07 EST 2004 [EMAIL PROTECTED]:/usr/src/sys/compile/LABDB i386 ... I have merged asr.c from RELENG_4 to get this fix: Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following change wasn't included: - Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP in case of a CHECK CONDITION. since I guess its conceivable this could cause my problem. --don ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: FreeBSD 4.9 / Supermicro 7043P-8R / Crashes After 2-5 Minutes Of Uptime
From: Doug White [mailto:[EMAIL PROTECTED] On Fri, 2 Apr 2004, Gustafson, Tim wrote: Hello I have a brand spanking new Supermicro 7043P-8R server with dual Intel 3.2gHZ Xeon processors and 4GB of Kingston memory. I installed FreeBSD 4.9 on the box and it gives me the following message on the screen about 2-5 minutes after it finished booting: boot() called on CPU#0 This means the machine is trying to reboot for some reason. Typically, its due to a panic. You should get a lot more output with a message and a traceback, or if you have ddb compiled in, a db prompt. If you aren't getting anything, try setting up serial console and log to another machine. Is this a known issue with Supermicro Motherboards? Does anyone have any suggestions as to a potential patch or other fix? I'm going to start doing the hardware swapping thing in a bit and see if that fixes anything, but I'd really like to hear back from anyone who has any experience with this issue. Random panics are generally caused by bad memory, CPU cache, anod other hardware issues. FYI, i'm using the same system. I would guesstimate that you have a problem running out of ram (sounds silly, but yes) due to the large amount of ram you have. Here's what i have in mine, currently running 5.2 kern.maxswzone=16777216 kern.vm.kmem.size=314572800 kern.ipc.nmbclusters=16384 kern.ipc.nmbufs=65536 kern.ipc.shm_use_phys=1 net.inet.tcp.tcbhashsize=16384 kern.ipc.maxsockets=32768 kern.maxfiles=34000 but i have run 4.7 and releng_4 on it. Definitely i would suggest running memtest86 on it for a bit [at least 24 hours], but with the ECC, memory errors would have to be gross to be noticeable. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: kern/59719 Re: 4.9 Stable Crashes on SuperMicro with SMP
From: Uwe Doering [mailto:[EMAIL PROTECTED] Jonathan Gilpin wrote: I've run memtest (memtest86.com) kindly provided by Don and it passed all the tests. I've installed installed a kernel module to test for memory errors and found that again no memory errors are found... So this means it's either a problem with the CPU's or a geniune bug in the kernel. (bugger!) No, that's unfortunately not what it means. If a memory test fails you can draw the conclusion that you have bad memory, but this doesn't work the other way round. If a memory test passes there is still a possibility that a memory chip is the culprit since memory test software cannot find all errors. Also, there is the chip set on the mainboard that coordinates bus access etc. for the two CPUs. Mainboard and chip set developers are known to make errors, too. In this case you would have to swap the entire mainboard, possible with one from a different manufacturer. I can tell you from my own experience that it is really hard to find reliable PC hardware these days, in light of ever shorter and faster product release cycles. I have several hundred of the motherboard the poster is using, and it works reliably with MP operation with 4.X. The memtest86 that i sent him understands the ECC registers on the e7501 MCH, it should find all correctable and uncorrectable errors. --don ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Using dwarf2 in kernel debug?
i'm evaluating an emulator for xeon, but it only supports dwarf2, not stabs. Is there a simple means of changing the config for the system gcc, or is this a much more involved thing? This is for releng_4. e.g. is loader, gdb -k, etc, all ok with seeing dwarf2 debugging info in /kernel, and in modules. --don ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: HA guide/tools
From: Matt Douhan [mailto:[EMAIL PROTECTED] Hello I have been hunting for some kind of High Availibility guide for FBSD but not been successfull in finding one, is there any clustering solutions for FBSD STABLE (or CURRENT) ? Matt See 'HUT' project (www.bsdshell.net) ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: support of SMBus on ICH3
From: Stijn Hoop [mailto:[EMAIL PROTECTED] On Tue, Aug 12, 2003 at 03:46:34PM +0200, Igor Pokrovsky wrote: On Tue, Aug 12, 2003 at 09:40:50AM -0400, Don Bowman wrote: ... I'll go out on a limb here... your motherboard doesn't use SMBus for anything, so the BIOS disables it. Just a guess. once you enable it (you could make a quirk for it in pci to get it during boot), you can scan the bus to see, but I suspect you'll only see the host controller. --don ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: kernel deadlock
On Tue, 29 Jul 2003, Don Bowman wrote: From: Don Bowman [mailto:[EMAIL PROTECTED] From: Robert Watson [mailto:[EMAIL PROTECTED] On Tue, 29 Jul 2003, Dave Dolson wrote: To follow up, I've discovered that the system has exhausted its FFS node malloc type. ... Some problems with this have turned up in -CURRENT on large-memory machines where some of the scaling factors have been off. In We currently have kern.maxvnodes=70354 set (automatically scaled). This is a 1GB box. I will try re-running the test with less. when it hits kern.maxvnodes, what will it do? After applying the fixes from RELENG_4 for kern/52425, I can still easily reproduce this hang without low memory. Further debugging shows that vnlru process is waiting on vlrup. This line is shown below. ie vnlru_nowhere is being incremented ever 3 seconds. So what is happening here is that vnlru wakes up, runs through, and there is nothing to free, so it goes back to sleep having freed nothing. The caller doesn't wake up. There's no vnodes to free, and everything in the system locks up. One possible solution is to make vnlru more aggressive, so that before giving up, it tries to free pages that have many references etc (which it currently skips). Another option is to have it simply bump the kern.maxvnodes number and wake up the process which called it. Suggestions? --don ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: kernel deadlock
From: The Hermit Hacker [mailto:[EMAIL PROTECTED] One possible solution is to make vnlru more aggressive, so that before giving up, it tries to free pages that have many references etc (which it currently skips). Another option is to have it simply bump the kern.maxvnodes number and wake up the process which called it. Suggestions? check out 4.8-STABLE, which Tor.Egge(sp?) made modifications to the vnlru process that sound exactly what you are proposing ... Actually that makes the problem worse in an other area, and doesn't fix this one. The 'fix' there is to do 10% of the noes on a free operation, rather than 10 at a time. Now the system will hang up for longer when its freeing them. However the root cause is still that it decides there are no freeable nodes in this case, so vnlru goes back to sleep having freed none, the caller stays asleep, and anyone else wanting a vnode goes to sleep too. --don ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
RE: panic with fork?
From: Don Bowman I find that if I run the below: $ while true; do daemon -f bash -c exit; done ... So the problem is in pmap_pinit(). It calls kmem_alloc_pageable() to obtain a page directory table. In this case, kmem_alloc_pageable() returns NULL, and the code doesn't check it. Now, pmap_pinit() clearly doesn't expect to be able to fail, there's no provision for returning error. Does anyone have a suggestion for a fix? --don To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
RE: spontaneous reboot gcc
From: Kris Kennaway [mailto:[EMAIL PROTECTED]] On Tue, Feb 18, 2003 at 11:59:48PM -0500, Darren Henderson wrote: I'm seeing a spontaneous reboot in one very specific circumstance that appears to be attributable to the use of gcc with -O2 optimization. Are you also compiling the kernel with -O2? If so, then Don't do that then. -O2 has had known serious bugs in the past. -O2 -pipe -malign-loops=4 -malign-jumps=4 -malign-functions=4 -mcpu=i686 -march=i686 -fno-gcse is what I'm using with good affect. I found one problem for sure with global common sub-expression elimination in the FreeBSD kernel w/ gcc 2.95. gcc 3.X is much more reliable, but not the default compiler for 4.X. The alignment used by default makes for fairly poor performance for pentium-4 / xeon architectures. --don To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
bridging problem?
I have a setup with 3 PCs, all running 4.7. Two of them are connected through the 3rd with gigabit ethernet. The first two have bge interfaces, the 3rd has 1 bge and 1 em. [CLIENT]--[BRIDGE]-[SERVER] bge bgeembge All seems good, ping both ways, no prob. Now I take 'iperf' and run 956Mbps (max) of UDP traffic through. Sometimes when I do this I find that the data stops flowing, and can only be restarted by setting net.link.ether.bridge=0 and then =1 again. $ netstat -m 1089/1888/131072 mbufs in use (current/peak/max): 1089 mbufs allocated to data 1088/1876/32768 mbuf clusters in use (current/peak/max) 4224 Kbytes allocated to network (4% of mb_map in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines $ netstat -s ... -- Bridging statistics (bdg) -- Name In Out Forward DropBcastMcastLocal Unknown em0:1 30090745 191 88462602 500 29206067 bge0:1 177 30090759 9206 680 11 it would appear that what goes in goes out (in em0 out bge0 in the above case). I'm not sure exactly what's going on. The bridge machine has about 90% of CPU free according to top, and doesn't seem to be running out of buffers. When the error case occurs, if I run 'tcpdump' on one side of the bridge I see the traffic that comes in, but on the other side i don't see it forwarding it. down an ifdown/ifup on both interfaces has no affect, but setting 'bridge=0' and then 'bridge=1' fixes it up all the time. ipfw is not enabled for the bridge: net.link.ether.bridge_cfg: em0 bge0 net.link.ether.bridge: 1 net.link.ether.bridge_ipfw: 0 net.link.ether.bridge_ipf: 0 net.link.ether.bridge_ipfw_drop: 0 net.link.ether.bridge_ipfw_collisions: 0 If I run the traffic through both ways (e.g. 956Mbps from left to right, and 956Mbps from right to left), then it doesn't seem to make any difference to how often the error occurs (but I use up more cpu, leaving about 78% free). How does one debug the bridge? --don ([EMAIL PROTECTED] www.sandvine.com) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
RE: bge bug w/ out of bounds return receiver, staying in rxeof all the time, patch
From: Sam Leffler [mailto:[EMAIL PROTECTED]] I would recommend a committer look this over and commit it. If you wish, I can make the patch *just* be the change (changing the 16-bit to 32-bit writes, without the VPD stuff), but the other changes seemed generally useful. Please whittle the patch down to just the bug fix; 5.0 is in code freeze. Sam Sigh, I was afraid someone would say that. Will do. The patch is against RELENG_4, but is fairly trivial. It is below, just the bug fix is there (changing the writing to the receiver control block to be 32-bits all the time). Patch follows: Index: if_bge.c === RCS file: /cvs/src/sys/dev/bge/if_bge.c,v retrieving revision 1.3.2.18 diff -U3 -r1.3.2.18 if_bge.c --- if_bge.c2 Nov 2002 18:22:23 - 1.3.2.18 +++ if_bge.c22 Nov 2002 02:01:48 - @@ -913,7 +913,7 @@ { int i; struct bge_rcb *rcb; - struct bge_rcb_opaque *rcbo; + bge_max_len_flags len_flags; for (i = 0; i BGE_JUMBO_RX_RING_CNT; i++) { if (bge_newbuf_jumbo(sc, i, NULL) == ENOBUFS) @@ -923,9 +923,9 @@ sc-bge_jumbo = i - 1; rcb = sc-bge_rdata-bge_info.bge_jumbo_rx_rcb; - rcbo = (struct bge_rcb_opaque *)rcb; - rcb-bge_flags = 0; - CSR_WRITE_4(sc, BGE_RX_JUMBO_RCB_MAXLEN_FLAGS, rcbo-bge_reg2); + len_flags.bge_len_flags = rcb-bge_len_flags.bge_len_flags; + len_flags.s.bge_flags = 0; + CSR_WRITE_4(sc, BGE_RX_JUMBO_RCB_MAXLEN_FLAGS, len_flags.bge_len_flags); CSR_WRITE_4(sc, BGE_MBX_RX_JUMBO_PROD_LO, sc-bge_jumbo); @@ -1133,6 +1133,7 @@ struct bge_rcb *rcb; struct bge_rcb_opaque *rcbo; int i; + bge_max_len_flags len_flags; /* * Initialize the memory window pointer register so that @@ -1202,12 +1203,13 @@ rcb = sc-bge_rdata-bge_info.bge_std_rx_rcb; BGE_HOSTADDR(rcb-bge_hostaddr) = vtophys(sc-bge_rdata-bge_rx_std_ring); - rcb-bge_max_len = BGE_MAX_FRAMELEN; + len_flags.s.bge_max_len = BGE_MAX_FRAMELEN; + len_flags.s.bge_flags = 0; + rcb-bge_len_flags.bge_len_flags = len_flags.bge_len_flags; if (sc-bge_extram) rcb-bge_nicaddr = BGE_EXT_STD_RX_RINGS; else rcb-bge_nicaddr = BGE_STD_RX_RINGS; - rcb-bge_flags = 0; rcbo = (struct bge_rcb_opaque *)rcb; CSR_WRITE_4(sc, BGE_RX_STD_RCB_HADDR_HI, rcbo-bge_reg0); CSR_WRITE_4(sc, BGE_RX_STD_RCB_HADDR_LO, rcbo-bge_reg1); @@ -1224,12 +1226,13 @@ rcb = sc-bge_rdata-bge_info.bge_jumbo_rx_rcb; BGE_HOSTADDR(rcb-bge_hostaddr) = vtophys(sc-bge_rdata-bge_rx_jumbo_ring); - rcb-bge_max_len = BGE_MAX_FRAMELEN; + len_flags.s.bge_max_len = BGE_MAX_FRAMELEN; + len_flags.s.bge_flags = BGE_RCB_FLAG_RING_DISABLED; + rcb-bge_len_flags.bge_len_flags = len_flags.bge_len_flags; if (sc-bge_extram) rcb-bge_nicaddr = BGE_EXT_JUMBO_RX_RINGS; else rcb-bge_nicaddr = BGE_JUMBO_RX_RINGS; - rcb-bge_flags = BGE_RCB_FLAG_RING_DISABLED; rcbo = (struct bge_rcb_opaque *)rcb; CSR_WRITE_4(sc, BGE_RX_JUMBO_RCB_HADDR_HI, rcbo-bge_reg0); @@ -1239,7 +1242,9 @@ /* Set up dummy disabled mini ring RCB */ rcb = sc-bge_rdata-bge_info.bge_mini_rx_rcb; - rcb-bge_flags = BGE_RCB_FLAG_RING_DISABLED; + len_flags.s.bge_max_len = 0; + len_flags.s.bge_flags = BGE_RCB_FLAG_RING_DISABLED; + rcb-bge_len_flags.bge_len_flags = len_flags.bge_len_flags; rcbo = (struct bge_rcb_opaque *)rcb; CSR_WRITE_4(sc, BGE_RX_MINI_RCB_MAXLEN_FLAGS, rcbo-bge_reg2); @@ -1259,8 +1264,9 @@ rcb = (struct bge_rcb *)(sc-bge_vhandle + BGE_MEMWIN_START + BGE_SEND_RING_RCB); for (i = 0; i BGE_TX_RINGS_EXTSSRAM_MAX; i++) { - rcb-bge_flags = BGE_RCB_FLAG_RING_DISABLED; - rcb-bge_max_len = 0; + len_flags.s.bge_max_len = 0; + len_flags.s.bge_flags = BGE_RCB_FLAG_RING_DISABLED; + rcb-bge_len_flags.bge_len_flags = len_flags.bge_len_flags; rcb-bge_nicaddr = 0; rcb++; } @@ -1272,17 +1278,20 @@ BGE_HOSTADDR(rcb-bge_hostaddr) = vtophys(sc-bge_rdata-bge_tx_ring); rcb-bge_nicaddr = BGE_NIC_TXRING_ADDR(0, BGE_TX_RING_CNT); - rcb-bge_max_len = BGE_TX_RING_CNT; - rcb-bge_flags = 0; + len_flags.s.bge_max_len = BGE_TX_RING_CNT; + len_flags.s.bge_flags = 0; + rcb-bge_len_flags.bge_len_flags = len_flags.bge_len_flags; /* Disable all unused RX return rings */ rcb = (struct bge_rcb *)(sc-bge_vhandle + BGE_MEMWIN_START + BGE_RX_RETURN_RING_RCB); - for (i = 0; i BGE_RX_RINGS_MAX; i++) { + rcb++; + for (i = 1; i BGE_RX_RINGS_MAX; i++) {
bge bug w/ out of bounds return receiver, staying in rxeof all the time, patch
(apologies if you got this more than once, but after 6 hours it hadn't shown up on the mailing list) There is a bug in the STABLE (and current) if_bge which causes the driver to loop forever in interrupt context (in bge_rxeof()). This is caused by the return ring length being 1024 in the driver, and erroneously decided to be 2048 in the chip, which causes it to return an index off the end off the ring. You will know you are running into this if your kernel locks up, ^T still works, and the debugger shows you in bge_rxeof() or a routine called from it. This situation can occur regardless of traffic. It seems to either work or not work from the get-go, so if you are going to run into it, it will be boolean from the machine startup. The patch attached solves this problem by changing the 16-bit writes into the chip's memory window to 32-bit writes. The patch also enables the PCI-VPD (See PCI 2.2) output (to help diagnose which version of the chip you have, whose board, how fast the PCI clock is etc). I would recommend a committer look this over and commit it. If you wish, I can make the patch *just* be the change (changing the 16-bit to 32-bit writes, without the VPD stuff), but the other changes seemed generally useful. Index: if_bge.c === RCS file: /cvs/src/sys/dev/bge/if_bge.c,v retrieving revision 1.3.2.18 diff -U3 -r1.3.2.18 if_bge.c --- if_bge.c2 Nov 2002 18:22:23 - 1.3.2.18 +++ if_bge.c21 Nov 2002 20:13:23 - @@ -114,6 +114,7 @@ #include dev/bge/if_bgereg.h #define BGE_CSUM_FEATURES (CSUM_IP | CSUM_TCP | CSUM_UDP) +#define BGE_VPD /* controller miibus0 required. See GENERIC if you get errors here. */ #include miibus_if.h @@ -178,6 +179,7 @@ static u_int8_tbge_eeprom_getbyte __P((struct bge_softc *, int, u_int8_t *)); static int bge_read_eeprom __P((struct bge_softc *, caddr_t, int, int)); +static void dump_manufacturing_information __P((struct bge_softc *)); static u_int32_t bge_crc __P((caddr_t)); static void bge_setmulti __P((struct bge_softc *)); @@ -200,11 +202,12 @@ static int bge_chipinit__P((struct bge_softc *)); static int bge_blockinit __P((struct bge_softc *)); -#ifdef notdef +#ifdef BGE_VPD +static void bge_vpd_crack __P((struct bge_softc *sc)); static u_int8_t bge_vpd_readbyte __P((struct bge_softc *, int)); static void bge_vpd_read_res __P((struct bge_softc *, struct vpd_res *, int)); -static void bge_vpd_read __P((struct bge_softc *)); +static void bge_vpd_read __P((struct bge_softc *, const char *)); #endif static u_int32_t bge_readmem_ind @@ -311,7 +314,7 @@ return; } -#ifdef notdef +#ifdef BGE_VPD static u_int8_t bge_vpd_readbyte(sc, addr) struct bge_softc *sc; @@ -355,9 +358,54 @@ return; } +/* + * Take the read-only (VPD-R) info and crack it into the other fields +*/ +static void +bge_vpd_crack(sc) + struct bge_softc *sc; +{ + int pos = 0; + int len = strlen(sc-bge_vpd_readonly); + sc-bge_vpd_pn = unknown; + sc-bge_vpd_ec = unknown; + sc-bge_vpd_mn = unknown; + sc-bge_vpd_sn = unknown; + sc-bge_vpd_rv = unknown; + while (pos len) { + if (!strncmp(sc-bge_vpd_readonly+pos, VPD_PN, 2)) { + sc-bge_vpd_pn = (sc-bge_vpd_readonly+pos+3); + } else if (!strncmp(sc-bge_vpd_readonly+pos, VPD_EC, 2)) { + sc-bge_vpd_ec = (sc-bge_vpd_readonly+pos+3); + } else if (!strncmp(sc-bge_vpd_readonly+pos, VPD_MN, 2)) { + sc-bge_vpd_mn = (sc-bge_vpd_readonly+pos+3); + } else if (!strncmp(sc-bge_vpd_readonly+pos, VPD_SN, 2)) { + sc-bge_vpd_sn = (sc-bge_vpd_readonly+pos+3); + } else if (!strncmp(sc-bge_vpd_readonly+pos, VPD_RV, 2)) { + sc-bge_vpd_rv = (sc-bge_vpd_readonly+pos+3); + } + sc-bge_vpd_readonly[pos] = '\0'; + pos += 2; + pos += sc-bge_vpd_readonly[pos]; + pos++; + } + pos = 0; + len = strlen(sc-bge_vpd_readwrite); + while (pos len) { + if (!strncmp(sc-bge_vpd_readwrite+pos, VPD_YA, 2)) { + sc-bge_vpd_asset_tag = (sc-bge_vpd_readwrite+pos+3); + } + sc-bge_vpd_readwrite[pos] = '\0'; + pos += 2; + pos += sc-bge_vpd_readwrite[pos]; + pos++; + } +} + static void -bge_vpd_read(sc) +bge_vpd_read(sc, defname) struct bge_softc *sc; + const char *defname; { int pos = 0, i; struct vpd_res res; @@ -366,14 +414,20 @@ free(sc-bge_vpd_prodname, M_DEVBUF); if (sc-bge_vpd_readonly != NULL)
bug in bge driver with ENOBUFS on 4.7
In bge_rxeof(), there can end up being a condition which causes the driver to endlessly interrupt. if (bge_newbuf_std(sc, sc-bge_std, NULL) == ENOBUFS) { ifp-if_ierrors++; bge_newbuf_std(sc, sc-bge_std, m); continue; } happens. Now, bge_newbuf_std returns ENOBUFS. 'm' is also NULL. This causes the received packet to not be dequeued, and the driver will then go straight back into interrupt as the chip will reassert the interrupt as soon as we return. Suggestions on a fix? I'm not sure why I ran out of mbufs, I have kern.ipc.nmbclusters: 9 kern.ipc.nmbufs: 28 (kgdb) p/x mbstat $11 = {m_mbufs = 0x3a0, m_clusters = 0x39c, m_spare = 0x0, m_clfree = 0x212, m_drops = 0x0, m_wait = 0x0, m_drain = 0x0, m_mcfail = 0x0, m_mpfail = 0x0, m_msize = 0x100, m_mclbytes = 0x800, m_minclsize = 0xd5, m_mlen = 0xec, m_mhlen = 0xd4} but bge_newbuf_std() does this: if (m == NULL) { MGETHDR(m_new, M_DONTWAIT, MT_DATA); if (m_new == NULL) { return(ENOBUFS); } and then returns ENOBUFS. This is with 4.7-RELEASE. --don ([EMAIL PROTECTED] www.sandvine.com) To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
RE: new intel chipset and SMP Xeons
-Original Message- From: Dmitry Valdov [mailto:dv;dv.ru] Sent: October 11, 2002 05:14 To: [EMAIL PROTECTED] Subject: new intel chipset and SMP Xeons Hi! Is there a plan to support Intel E7500 chipset in -stable? Without this FreeBSD can't use 2nd CPU. I'm using FreeBSD 4.6 on an E7500 platform (a supermicro P4DPR, http://www.supermicro.com/PRODUCT/MotherBoards/E7500/P4DPR.htm) with dual XEON. This is working ok for me, I see all '4' processors (2 per xeon because of the symmetric multi-threading). To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
RE: number mbufs / cluster
Andrew Gallatin wrote: ... I tried changing MCLSHIFT to 10. Although this would seem like the right thing to do, some trouble ensues... Some things like nfs don't seem to work right (just UDP traffic as far as I could see). (10 would yield a 1K cluster). Are there any assumptions somewhere that a cluster is = MTU size? Possibly. What nic driver are you using? bge on a broadcom BCM5701. I may also end up using an 'em'. I've made another discovery (I think), that the amount of memory is dramatically different when ipfw is used. Under 'normal' use for my application, I would have: # ipfw list 00100 allow ip from any to any via lo0 00105 fwd 127.0.0.1,9000 tcp from any to any 5000 recv bge0 00200 deny ip from any to 127.0.0.0/8 00300 deny ip from 127.0.0.0/8 to any 65000 allow ip from any to any 65535 allow ip from any to any for test purposes, I've made the device under test be the default router for the test generators, they address someone else on TCP port 5000, and it arrives @ the device under test, which then takes it to an application on local 9000. If I change the test to directly address the device under test, I get 10x less memory used in clusters. e.g, DUT has ip aliases as 1.0.0.1, 1.0.2.1, 1.0.4.1, ... The generators have 1.0.0.2-1.0.1.255 with default route 1.0.0.1, 1.0.2.2-1.0.3.255 with default route 1.0.2.1, etc Then the first generator, from source address in the 1.0.0.2/23 range addresses IP's in the 1.0.2.2/23 range, which ends up hitting the device under test, and being evaluated by the rule 105 above. This is just how the test is setup, in the actual application I would be using policy-based routing or content switching on e.g. cisco/nortel/... to drop the data into my bge0 interface. Thanks for the tip on the nmbufs = 2* nmbclusters, I was setting them approx equal. --don To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
panic in syncache/rtfree with 4.6
Has anyone else seen anything like this... panic(c0334f20,c392e400,ff807f44,c021fd76,c392e400) at panic+0xa4 rtfree(c392e400) at rtfree+0x27 syncache_free(debe8fc0,6200,debe8fc0,1,debe8660) at syncache_free+0x56 syncache_drop(debe8fc0,0,1,c0220118,4000) at syncache_drop+0xd8 syncache_timer(1,4000,0,0,) at syncache_timer+0xa8 softclock(0,ff800018,10,c0390010,) at softclock+0xd1 doreti_swi() at doreti_swi+0xf The system is running 4.6. There were messages that immediately preceeded this about 'All mbuf clusters exhausted, please see tuning(7).'. Does it seem reasonable to get a panic in this circumstance? The system is setup with 1GB of memory, 2x 2GHz XEON processors. I'm using ipfw with a 'fwd' rule to attract non-local traffic, like this: 00100 allow ip from any to any via lo0 00101 allow ip from me to any 00102 allow ip from any to me 00105 fwd 127.0.0.1,8000 tcp from any to any 5000 00200 deny ip from any to 127.0.0.0/8 00300 deny ip from 127.0.0.0/8 to any 65000 allow ip from any to any 65535 allow ip from any to any I have a process listening on local tcp port 8000 which uses kqueue/kevent with a single thread of execution. I had: # sysctl -a |grep nmb kern.ipc.nmbclusters: 6656 kern.ipc.nmbufs: 26624 (which I'm in the process of increasing :) I'm expecting to keep ~35K TCP sessions open on this device. Is syncache appropriate, or should I disable and use an external device to protect against DOS? net.inet.ip.fw.dyn_syn_lifetime: 20 net.inet.tcp.syncookies: 1 net.inet.tcp.syncache.bucketlimit: 30 net.inet.tcp.syncache.cachelimit: 15359 net.inet.tcp.syncache.count: 0 net.inet.tcp.syncache.hashsize: 512 net.inet.tcp.syncache.rexmtlimit: 3 post-restart I see: vm.zone: ITEMSIZE LIMITUSEDFREE REQUESTS PIPE:160,0, 2,100, 46 SWAPMETA:160, 256702, 0, 0,0 unpcb: 160,0, 5, 45, 61 ripcb: 192,12328, 0, 21,1 divcb: 192,12328, 0, 0,0 syncache:160,15359, 0, 0,0 tcpcb: 544,12328, 2, 13,2 udpcb: 192,12328, 5, 37, 47 socket: 192,12328, 12, 30, 116 KNOTE:64,0, 0,128, 10 DIRHASH:1024,0, 20, 4, 29 NFSNODE: 352,0, 3, 19,3 NFSMOUNT:544,0, 3, 11,3 VNODE: 192,0,895, 59, 895 NAMEI: 1024,0, 0, 16, 3309 VMSPACE: 192,0, 21, 43, 164 PROC:416,0, 26, 23, 169 DP fakepg:64,0, 0, 0,0 PV ENTRY: 28, 1756814, 7080, 254927,43688 MAP ENTRY:48,0,301,167,10625 KMAP ENTRY: 48,64303, 84,129, 709 MAP: 108,0, 7, 3,7 VM OBJECT:96,0,407, 59, 2747 --don To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
bge0 system hang when PCI-X enabled
I've noticed that the 5700 and 5701 cause a system crash in our new supermicro dual xeon systems (P4DPR+) unless PCI-X is disabled (motherboard jumper). The 5700 (or 5701) is the only card on the bus. This is under FreeBSD 4.6 prerelease (and 4.5). 'bge0' is the 5701 in the map below. Is this expected behaviour? It happens when there's a bit of traffic (e.g. an 'ls -lR' of an NFS mounted dir). $ sudo pciconf -l chip0@pci0:0:0: class=0x06 card=0x358015d9 chip=0x25408086 rev=0x02 hdr=0x00 none0@pci0:0:1: class=0xff card=0x358015d9 chip=0x25418086 rev=0x02 hdr=0x00 pcib1@pci0:2:0: class=0x060400 card=0x chip=0x25438086 rev=0x02 hdr=0x01 uhci0@pci0:29:0:class=0x0c0300 card=0x358015d9 chip=0x24828086 rev=0x02 hdr=0x00 uhci1@pci0:29:1:class=0x0c0300 card=0x358015d9 chip=0x24848086 rev=0x02 hdr=0x00 uhci2@pci0:29:2:class=0x0c0300 card=0x358015d9 chip=0x24878086 rev=0x02 hdr=0x00 pcib4@pci0:30:0:class=0x060400 card=0x chip=0x244e8086 rev=0x42 hdr=0x01 isab0@pci0:31:0:class=0x060100 card=0x chip=0x24808086 rev=0x02 hdr=0x00 atapci0@pci0:31:1: class=0x01018a card=0x358015d9 chip=0x248b8086 rev=0x02 hdr=0x00 none1@pci0:31:3:class=0x0c0500 card=0x358015d9 chip=0x24838086 rev=0x02 hdr=0x00 none2@pci1:28:0:class=0x080020 card=0x358015d9 chip=0x14618086 rev=0x03 hdr=0x00 pcib2@pci1:29:0:class=0x060400 card=0x0050 chip=0x14608086 rev=0x03 hdr=0x01 none3@pci1:30:0:class=0x080020 card=0x358015d9 chip=0x14618086 rev=0x03 hdr=0x00 pcib3@pci1:31:0:class=0x060400 card=0x0050 chip=0x14608086 rev=0x03 hdr=0x01 bge0@pci2:1:0: class=0x02 card=0x100610b7 chip=0x164514e4 rev=0x15 hdr=0x00 ahc0@pci3:2:0: class=0x01 card=0x900515d9 chip=0x00cf9005 rev=0x01 hdr=0x00 ahc1@pci3:2:1: class=0x01 card=0x900515d9 chip=0x00cf9005 rev=0x01 hdr=0x00 em0@pci3:4:0: class=0x02 card=0x100d8086 chip=0x100d8086 rev=0x02 hdr=0x00 none4@pci4:1:0: class=0x03 card=0x00081002 chip=0x47521002 rev=0x27 hdr=0x00 fxp0@pci4:2:0: class=0x02 card=0x10508086 chip=0x12298086 rev=0x0d hdr=0x00 To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message
gprof pthreads, crash (null pointer)
Can anyone tell me if gprof (ie compiling with -pg) is supposed to work with pthreads? I'm finding I get a null pointer from pthread_create(). Actually, the program counter becomes a null pointer, as in: #0 0x in ?? () #1 0x08052cd3 in osiThread::Start( ... ) I'm using gcc30 from the ports, specifically gcc30-3.0.4. on FreeBSD 4.6-RC #2: Wed May 22 19:30:53 EDT 2002 $ gcc30 -v Reading specs from /usr/local/lib/gcc-lib/i386-portbld-freebsd5.0/3.0.4/specs Configured with: ./..//gcc-3.0.4/configure --disable-nls --with-gnu-as --with-gnu-ld --with-gxx-include-dir=/usr/local/lib/gcc-lib/i386-portbld-freebsd5.0/3.0.4/ include/g++ --disable-shared --prefix=/usr/local i386-portbld-freebsd5.0 Thread model: posix gcc version 3.0.4 I'm compiling everything with -pg. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-stable in the body of the message