I've confirmed that neither problem exists in 4.0. There are ample
work-arounds, both hardware and software, including just not using 3.3.
No fixes, though, just work-arounds... Workarounds for the NCR/FXP
issue included:
1) Using 2.2.8 (4.0 isn't a production option).
2) Using a different NIC (a Tulip worked fine).
3) Using a different SCSI adapter (Adaptec, as Matt suggested, works fine).
4) Using a different SCSI driver (Peter managed to get a driver from 4.0
hooked up under 3.3, and it survived two days of torture that would
have toasted things within an hour using the stock driver; you'll have
to ask him for details).
Workarounds for the pagedaemon issue included:
1) Using 2.2.8 (4.0, too, but not as a production option)
(do I see a pattern?)
2) Using read()/write() instead of mmap() for certain file updates in
our application. In this case read()/write() performed better anyhow.
So the two issues I described are no longer "active" for the purposes
of my project. I posted because I feared that what I saw as the main
issue--that 3.3 is regarded in some circles as not being up to FreeBSD
standards--was getting lost in various unseemly side-issues. It could
be that I was just plain unlucky, but my experiences suggest that there
may be some merit to that view. You be the judge.
I've been with BSD a long time--from back when my email address was
decvax!randvax!edhall. I want it to succeed, for reasons that are more
emotional than rational; my nightmare was having to say that my project
(1) worked on Solaris, (2) worked on Linux, but (3) broke FreeBSD.
I'd be a pretty poor engineer to play favorites when the facts point
in another direction. Fortunately, we were able to discover a more
favorable set of facts. This time.
-Ed
: Matthew Dillon <[EMAIL PROTECTED]> wrote:
: :You write:
: :: we can not identify the specific problem from this message.
: :: without sufficient information to indentify and hopefully reproduce
: :: the problem, we can not address it. please provide this information
: :: if it is available to you. if it is not, please provide us contact
: :: information for the commercial entities experiencing the problem.
: :
: :I work at Yahoo. My address there is "[EMAIL PROTECTED]".
: :
: :On a recent project I encountered two show-stopping bugs with 3.3-release
: :that did not exist in 2.2.8-release:
: :
: :1) Random crashes in FXP interrupt or low-level IP code. Something is
: : clobbering the kernel stack--possibly the NCR driver, since using an
: : Adaptec made the problem stop, as did a backport of the CAM driver
: : Peter Wemm tried. This was on an N440BX, which is becoming quite
: : common in server applications. Other installations are apparantly
: : seeing the same problem on this hardware.
: :
: :2) A hard loop in the pagedaemon. This was especially egregious, since
: : it meant the system had to be rebooted from the console--and since
: : the application could elicit the problem within a few minutes.
: : Disabling the use of mmap() for file update in the application
: : prevented the problem. After spending a day trying to cook up a
: : test program that elicited the same behavior that the application
: : did, I gave up for lack of time. But there have been other reports
: : of late that sound like this problem, mostly in high VM/RAM situations.
: :
: :That's two serious bugs that exist in 3.3-release but not in 2.2.8-release.
: :Looking back through the archives, I can see that I'm not the only one who
: :has experienced them. I came away from the experience with the feeling that
: :the FreeBSD project has some serious Q/A problems... and I can assure you,
: :I'm not alone in this feeling.
: :
: : -Ed
:
: Well, #2 at least should be fixed in -current. Unfortunately the
: changes to the VM system were too extensive to backport to 3.x. Or,
: I should say, that at the time I started working on the VM system core
: was not interested in allowing me to backport the changes, and then later
: it was simply too late - too many changes had been made.
:
: #1 has come up a couple of times. There was a conversation in October
: that closely relates to your problem:
:
: :From: Joe McGuckin <[EMAIL PROTECTED]>
: :Subject: fxp related kernel panic
: :
: :I have a 3.3-stable machine that I use as a news router (running diablo). The
: :fxp0 interface averages 10-15 Mbps bandwidth continously.
: :
: :About once a week the machine crashes & reboots. We enabled the debugger this ti
: :me
: :and captured the following debug output:
: :
: :Fatal trap 12: page fault while in kernel mode
: :fault virtual address = 0x382e4641
: :fault code = supervisor write, page not present
: :instruction pointer = 0x8:0xc01a372e
: :stack pointer = 0x10:0xc02523b0
: :frame pointer = 0x10:0xc02523c0
: :code segment = base 0x0, limit 0xfffff, type 0x1b
: : = DPL 0, pres 1, def32 1, gran 1
: :processor eflags = interrupt enabled, resume, IOPL = 0
: :current process = Idle
: :interrupt mask = net
: :kernel: type 12 trap, code=0
: :Stopped at fxp_add_rfabuf+0x1de: movw %ax,0x4(%esi)
: :db>
: :
: :%uname -a
: :FreeBSD feeder.via.net 3.3-STABLE FreeBSD 3.3-STABLE #7: Mon Oct 18 17:14:40 PDT
: : 1999 [EMAIL PROTECTED]:/usr/src/sys/compile/DIABLO i386
: :
: :%dmesg
: :Copyright (c) 1992-1999 FreeBSD Inc.
: :Copyright (c) 1982, 1986, 1989, 1991, 1993
: : The Regents of the University of California. All rights reserved.
: :FreeBSD 3.3-STABLE #7: Mon Oct 18 17:14:40 PDT 1999
:
: To which DG responded:
:
: :From: David Greenman <[EMAIL PROTECTED]>
: :Subject: Re: fxp related kernel panic
: :To: Joe McGuckin <[EMAIL PROTECTED]>
: :Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
: :Date: Tue, 26 Oct 1999 11:43:02 -0700
: :
: :
: : Let me guess...your system has an Intel N440BX motherboard, right? If so,
: :then it's a known problem with no solution yet.
: :
: :-DG
: :
: :David Greenman
: :Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org
: :Creator of high-performance Internet servers - http://www.terasolutions.com
: :Pave the road of life with opportunities.
:
: And he also said:
:
: :From: David Greenman <[EMAIL PROTECTED]>
: :Subject: Re: fxp related kernel panic
: :To: Lew Payne <[EMAIL PROTECTED]>
: :Cc: [EMAIL PROTECTED], Joe McGuckin <[EMAIL PROTECTED]>
: :Date: Tue, 26 Oct 1999 13:19:45 -0700
: :
: :
: :>Hi David -- What if I install a *real* EtherExpress Pro-100B (or
: :>whatever it's known as today) in the PCI slot, and use it instead
: :>of the on-board (N440BX motherboard) fxp0 interface?
: :>
: :>Judging that you probably know the nature of the problem, do you
: :>think this might circumvent it?
: :
: : I think it is caused by the NCR/Symbios controller. It might be a side
: :effect of the NCR just using up a lot of PCI bandwidth, with the real bug
: :being in the fxp driver (although I've looked and haven't found one). So
: :I don't think putting in a real Pro/100 will have any effect on the problem.
: :Of course I don't really know what is causing it, so just about anything
: :is possible.
: :
: :-DG
: :
: :David Greenman
:
: And that, I'm afraid is where it has been left. Nobody is sure where
: the problem is. I suspect that it may be a DMA synchronization problem
: with either the NCR or the FXP driver, or perhaps heavy PCI bandwidth
: useage is generating a FIFO overrun error during the FXP DMA that the
: driver is not handling properly. I just don't know.
:
: The only current solution is to use an adaptec controller. I have
: personally had *extremely* good luck with adaptec's, 2940UW, 7896 (or 97)
: U2W (on-motherboard), and 7890 (or 91) U2W (PCI card).
:
: I think part of the reason the problem has not been fixed is that many
: of the hardcore developers are using Adaptec controllers rather then NCR
: controllers and simply cannot reproduce it.
:
: -Matt
: Matthew Dillon
: <[EMAIL PROTECTED]>
:
To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message