Hi Andy,

It sounds as if your problem is more related to the NIC driver then, I guess.

I just realized something else that I forgot to mention regarding crash/freeze causes: network buffers. It's an out-of-memory problem, but a bit more specific. When you have a ton of open TCP connections to a host, that host will allocate a lot of (kernel) memory for TCP transmit/receive buffers. In newer *Linux* kernels, this memory is being allocated in an adaptive manner - i.e. the kernel only allocates a small amount of memory to each TCP buffer, and then increases it as necessary (per connection, depending on transfer speed and network delay to the other peer). Older kernels, however, will allocate a fixed amount per socket, which can quickly eat up all available kernel memory.

I think I actually discussed this with FreeBSD developers a while ago (on this list even?), and they told me the FreeBSD kernel can only allocate max 2GB of kernel memory. I don't know if it allocates network buffers dynamically (i.e. as much memory as is necessary for each socket/connection) but 2GB is not a lot if each connection uses up e.g. 100K buffer memory. If you have e.g. 1GB available to network buffers, it means a max limit of 10k simultaneous connections on a server, regardless of how much memory it has.

Regards,

  /Ragnar



On 09/03/2012 06:14 AM, Andy Young wrote:
Hi Ragnar,

Thank you for the reply. That makes a lot of sense. I think the resources at risk had to do with the low level details of the network card. I experimented tonight with bumping the hw.igb.rxd and hw.igb.txd tunable parameters of the NIC driver to their max value of 4096 (the default was 256). This seems to have resolved the issue. Before bumping their values, my load test was crashing the network at about 350 simultaneous connections. The behavior I witnessed was the application server (jetty) would seize up first and I could see it was no longer responding through my ssh connections. If I killed it off right away, all of the connections got closed and everything went back to being fine. If I left it in that state for 30 seconds or so, the system became unrecoverable. The connections remained open according to netstat even after I had closed the server and client processes. Short of rebooting, nothing I did would close the connections down. After bumping the rxd and txd parameters as well as kern.ipc.nmbclusters, the problem seems to have gone away. I can now successfully simulate over 800 simultaneous connections and it hasn't crashed since. To be honest, I don't know what the rxd and txd parameters do but it seems to have helped.

Andy

On Sep 2, 2012 1:58 AM, "Ragnar Lonn" <rag...@gatorhole.com <mailto:rag...@gatorhole.com>> wrote:

> Hi Andy,
>
> I work for an online load testing service (loadimpact.com <http://loadimpact.com/>) and what we > see is that the most common cause when a server crashes during a load test,
> is that it runs out of some vital system resource. Usually system memory,
> but network connections (sockets/file descriptors) is also a likely cause.
>
> You should have gotten some kind of error messages in the system log, but
> if the problem is easily repeatable I would set up monitoring of at least
> memory and file descriptors, and see if you are near the limits when the
> machine freezes.
>
> Regards,
>
>   /Ragnar

On Sat, Sep 1, 2012 at 10:44 PM, Andy Young <ayo...@mosaicarchive.com <mailto:ayo...@mosaicarchive.com>> wrote:

    I read through the driver man page, which is a great source of
    information. I see I'm using the Intel igb driver and it supports
    three tunables. Could I have exceeded the number of receive
    descriptors? What would the effect of this number being too low
    be? What about the Adaptive Interrupt Moderation?

    To clarify, I was simulating about 800 users simultaneously
    uploading files when the crash occurred.

    Thanks for any help or insights!!

    Andy

    NAME
         igb -- Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter
    driver

    LOADER TUNABLES
         Tunables can be set at the loader(8) prompt before booting
    the kernel or
         stored in loader.conf(5).

         hw.igb.rxd
                 Number of receive descriptors allocated by the
    driver.  The
                 default value is 256.  The minimum is 80, and the
    maximum is
                 4096.

         hw.igb.txd
                 Number of transmit descriptors allocated by the
    driver.  The
                 default value is 256.  The minimum is 80, and the
    maximum is
                 4096.

         hw.igb.enable_aim
                 If set to 1, enable Adaptive Interrupt Moderation.
     The default
                 is to enable Adaptive Interrupt Moderation.


    On Sat, Sep 1, 2012 at 4:14 PM, Andy Young
    <ayo...@mosaicarchive.com <mailto:ayo...@mosaicarchive.com>> wrote:

        Last night one our servers went offline while I was load
        testing it. When I got to the datacenter to check on it, the
        server seemed perfectly fine. Everything was running on it,
        there were no panics or any other sign of a hard crash. The
        only problem is the network was unreachable. I couldn't
        connect to the box even from a laptop directly attached to the
        ethernet port. I couldn't connect to anything from the box
        either. It was if the network controller had seized up. I
        restarted netif and it didn't make a difference. Rebooting the
        machine however, solved the issue and everything went back to
        working great. I restarted the load testing and reproduced the
        problem twice more this morning so at least its repeatable. It
        feels like a network controller / driver issue to me for a
        couple reasons. First, the problem affects the entire system.
        We're running FreeBSD 9 with about a half dozen jails. Most of
        the jails are running Apache but the one I was load testing
        was running Jetty. However, if it was my application code
        crashing I would expect the problem to at least be isolated to
        the jail that hosts it. Instead, the entire machine and all
        jails in it lose access to the network.

        Apart from not being able to access the network, I don't see
        any other signs of problems. This is the first major problem
        I've had to debug in FreeBSD so I'm not a debugging expert by
        any means. There are no error messages in /var/log/messages or
        dmesg apart from syslogd not being able to reach the
        network. If anyone has ideas on where I can look for more
        evidence of what is going wrong, I would really appreciate it.

        We're running FreeBSD 9.0-RELEASE-p3. The network controller
        is a Intel(R) PRO/1000 Network Connection version - 2.2.5
        configured with 6 ips using aliases, five of which are used
        for jails.

        Thank you for the help!!

        Andy





-- Andrew Young
    Mosaic Storage Systems, Inc
    http://www.mosaicarchive.com/

    Follow us on:
    Twitter <https://twitter.com/#%21/MosaicArchive>, Facebook
    <http://www.facebook.com/MosaicArchive>, Google Plus
    
<https://plus.google.com/b/102077382489657821832/https://plus.google.com/b/104681960235222388167/104681960235222388167/posts>,
    Pinterest <http://pinterest.com/mosaicarchive/>




--
Andrew Young
Mosaic Storage Systems, Inc
http://www.mosaicarchive.com/

Follow us on:
Twitter <https://twitter.com/#%21/MosaicArchive>, Facebook <http://www.facebook.com/MosaicArchive>, Google Plus <https://plus.google.com/b/102077382489657821832/https://plus.google.com/b/104681960235222388167/104681960235222388167/posts>, Pinterest <http://pinterest.com/mosaicarchive/>


_______________________________________________
freebsd-hardware@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "freebsd-hardware-unsubscr...@freebsd.org"

Reply via email to