Info:
- 4 windows 2000 domain controllers
- linux box joins the domain and uses Kerberos active directory authentication to shares - distribution: Gentoo 1.4
- kernel 2.4.26 (stock sources)
- current version of samba: 3.0.4
- If anything else is need please let me know
- configure command to compile:
./configure --prefix=/usr --sysconfdir=/etc/samba --localstatedir=/var --libdir=/usr/lib/samba
--with-privatedir=/etc/samba/private --with-lockdir=/var/cache/samba --with-piddir=/var/run/samba
--with-swatdir=/usr/share/swat --with-configdir=/etc/samba --with-logfilebase=/var/log/samba
--enable-static --enable-shared --with-manpages-langs=en --without-spinlocks --with-libsmbclient
--with-automount --with-smbmount --with-winbind --with-syslog --with-idmap --with-ldap
--with-ads --with-krb5 --with-pam
Problem:
After compiling and installing samba and copying the pam_winbind.so, libnss_winbind.so, and libnss_wins.so files to the appropriate directories I then start samba and winbind using a startup script. It takes about 30sec to a minute for authentication to start working (probably winbind talking to the DCs). Once it starts authenticating it works GREAT and will continue to do so for a period of 3 days to a week. Once it hits a certain point winbind will no longer authenticate. Since I have having this problem for a while now, I have been monitoring winbindd. It seems that around 3 hours after I start winbindd sockets in the CLOSE_WAIT state will start accumulating when I run the netstat âantupo command. All the sockets in this state are owned by the winbindd process. They will never close unless I kill the winbindd process. Once the number of CLOSE_WAITs accumulate up around 1000 it will cause winbindd to stop authenticating, samba to crash, and I will not be able to ssh in (I can connect, I can authenticate, but after I successfully authenticate ssh shoots back a signal 11 error and drops the connection). I believe the ssh problem is caused by winbind because of all sockets and port numbers it has tied up in the close_wait state. Once I restart winbindd and sshd everything works fine again until that certain amount of time. After doing much research I found that it is usually the application that is not closing the socket correctly, due to a bug. At first I thought it might be the kernel so I upgraded from 2.4.25 to 2.4.26 but the same symptoms came about. After that I was reading a developers forum and someone said that if you kill the process that owns the sockets in the close_wait state and they disappear then it is not a kernel issue. Also during the monitoring of winbindd I noticed that amount of memory consumption steadily increases (maybe a leak?). I wanted to be able to show the developers and everyone else what I was seeing so I wrote a script and tossed in a cronjob to run every hour 10 minutes after the hour. The script runs the following commands and spits the output to a text file. This isn't the entire script but it is the meat of it.
LOG_FILE=`date +%F_%H.%M%P_winbind_info.log` PREFIX=/var/log/winbind/ ps aux | grep PID >> $PREFIX$LOG_FILE ps aux | grep winb >> $PREFIX$LOG_FILE ps aux | grep mbd >> $PREFIX$LOG_FILE cat "/proc/`cat /var/run/samba/winbindd.pid`/status" >> $PREFIX$LOG_FILE netstat -antupo >> $PREFIX$LOG_FILE
I put the all the logs starting from the minute I started winbindd up until now on a webpage for people to see. They are in order by date and time and you will be able to see how things progress, memory usage, and the close_wait problem. Hopefully the developers can use this information. If not it would be great if anyone has any idea on why I have all these CLOSE_WAITS. I am replying to a previous post that created, but back then I was just going to upgrade to see if I still had the same problems. And I did, as you can see. Any insight would be great. I would be glad to entertain any questions or tests that people would like me to try. I have a test server and a production server and this problem happens on both.
Go to www.analoglove.com/winbind <http://www.analoglove.com/winbind>
Below is how the message ended the last time i posted about this.
Thank you very much for you time, Majeed Qulbain
Majeed wrote:
Im going to install the new version, and report back in a week or so. Thanks for the reply!
Majeed
Tim Jordan wrote:
I seen a there is a fix for winbind crashing in the latest release notes.
http://download.samba.org/samba/ftp/pre/
TJ
On Mon, 2004-04-05 at 10:25, Majeed wrote:
/I have also been seeing this over the last few weeks. For me it also happens randomly as you stated. I am trying to pin point when it started, and I believe it started right after I upgraded the kernel 2.4.24 to 2.4.25 (vanilla sources on gentoo 1.4) (mremap problems), but I can't be too sure. Samba 3.0.2 compiled with the following options:
./configure --prefix=/usr --sysconfdir=/etc/samba --localstatedir=/var --libdir=/usr/lib/samba --with-privatedir=/etc/samba/private --with-lockdir=/var/cache/samba --with-piddir=/var/run/samba --with-swatdir=/usr/share/swat --with-configdir=/etc/samba --with-logfilebase=/var/log/samba --enable-static --enable-shared --with-manpages-langs=en --without-spinlocks --with-libsmbclient --with-automount --with-smbmount --with-winbind --with-syslog --with-idmap --with-ldap --with-ads --with-krb5 --with-pam
Here are some symptoms I am seeing when the problem occurs.
Symptom 1) I cannot login through ssh: Its wierd becuase i can connnect, put in my username and password it authenticates but then the connection gets reset. There is even a line in the ssh log file that says access was granted. I then to to the console and login.
Symptom 2) While logged into the console I run a "netstat -antu" and get some interesting results
tcp 0 0 sambaserv_ip:44134 win2000dc_ip:139 CLOSE_WAIT
tcp 0 0 sambaserv_ip:44072 win2000dc_ip:139 CLOSE_WAIT
tcp 0 0 sambaserv_ip:44075 win2000dc_ip:139 CLOSE_WAIT
tcp 0 0 sambaserv_ip:44076 win2000dc_ip:139 CLOSE_WAIT
tcp 0 0 sambaserv_ip:44078 win2000dc_ip:139 CLOSE_WAIT
tcp 0 0 sambaserv_ip:44079 win2000dc_ip:139 CLOSE_WAIT
There are HUNDREDS of these CLOSE_WAIT lines all with different ascending port numbers
After restarting samba and winbind netstat looked normal and everything worked as it should have.
Symptom 3) While logged into the console I check the samba log files and log.winbind showed the following problems.
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
[2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
open_socket_in(): socket() call failed: Too many open files
Again there were HUNDREDS of these lines.
So I think winbind might be the cause of the problems. This happens on both my production and my test server. Test server is mirrored to production for testing.
Today I am going to download the newest version of the samba 3 and see if that helps, if it doesn't then I might try a different kernel version. As mentioned before all i do is restart samba and winbind and thinks will work perfectly for a random amount of time. Usually 3 or more days before it happens again.
Does anyone have any suggestions? Maybe some different things I could look for? Maybe different compile options?
Thanks Majeed Qulbain
Hoskinson, David P wrote:
We have a windows 2003 dc here at the university and I have successfully
setup samba-3.0.2-6.3E on a RHEL WS3 machine. The problem is that after
several hours, or several days winbind stops running and connections
fail. I have seen instances of this on other sites, but no firm
answers. I can provide files and logs if helpful
/
-- To unsubscribe from this list go to the following URL and read the instructions: http://lists.samba.org/mailman/listinfo/samba
