Hi Ken,

Ken Cross wrote:
I've run into a problem with winbindd in both 2.2.x and 3.0 where it
just locks up after a while on large, busy networks.

We finally tracked down the problem to the fact that the C library
"select" function is limited by default to 256 file descriptors in
NetBSD (1024 in FreeBSD, 2048 in Linux).  So once 256 (or whatever) smbd
processes connected to winbindd, it broke pretty badly and was very hard
to kill.

This is set at compile-time, not run-time.  This line:

 #define FD_SETSIZE 2048  /* Max # of winbindd connections */

must occur before the first invocation of <sys/types>.

This could be a build option, but it might be much simpler to hard-code
it in local.h, which is what I did to fix it.

Can somebody check the implications of this on Solaris, HPUX, etc.?
This will hardly do on HP-UX, because there is a kernel parameter
"maxfiles" controlling the per-process max number of filedescriptors.

It's 60 by default after installation, but is tunable (with reboot).
I would not recommend to set it too high, since it's also a fuse against
single user processes eating up all available file descriptors (controlled
by nfiles).

We have hit the limit *very* quickly on our Winbind production box,
of course, and I have increased maxfiles to 300. Still quite low
when expecting a couple of hundret smbd to become winbind clients.
Each of them consuming two FDs.

The solution (and this should also work on other platforms) was to
have winbindd housekeep its client connections by shutting down
idle connections, and have clients reconnect when required:

  http://lists.samba.org/pipermail/samba-technical/2003-February/042210.html

The threshold was chosen to be 100 active connections, which keeps
winbindd well below 300 FDs. Below 140, actually, including network
sockets and open database and log files.

This only works out well if clients don't connect too frequently,
however, and

  http://lists.samba.org/pipermail/samba-technical/2003-February/042170.html

helped achieving this.

I'm tracking winbindd shutting down sockets for about a week now,
and have extended the DEBUG line in remove_idle_client() to also print
idle time of removal candidates.

With about 100 concurrent smbds (i.e. ~200 client pipes) it
almost always finds connections idle for more than an hour.
I would assume forcing these to reconnect should have no measurable
impact, and the solution should scale to a multitude of its
current load.

It can't be applied directly to 3.0, however. I'm assuming that identifying
idle connections is more complicated there, as both read and write buffers
can be empty while waiting for a request to complete. But it should
nevertheless be possible.

Cheers!
Michael

Reply via email to