Hello,
We are using Samba 3.4.6 (packaged by opencsw.org) against Active Directory
2003 on our primary University filestore. The operating system is Solaris 10
Update 10. We have a number of domain controllers. For the past two days on our
main filestore has been failing connections from a number of clients.
When using smbclient (or indeed any client) connecting to the Samba server we
see logs similar the following with log level of 3:
[2012/07/18 20:00:33.762539, 3] libsmb/namequery.c:2461(get_dc_list)
get_dc_list: preferred server list: ", *"
[2012/07/18 20:00:43.756966, 1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2
(Interrupted system call)
[2012/07/18 20:00:43.757104, 0]
lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
tdb_chainlock_with_timeout_internal: alarm (10) timed out for key
UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:00:43.757214, 1] lib/server_mutex.c:74(grab_named_mutex)
Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:00:53.756881, 1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2
(Interrupted system call)
[2012/07/18 20:00:53.757009, 0]
lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
tdb_chainlock_with_timeout_internal: alarm (10) timed out for key
UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:00:53.757130, 1] lib/server_mutex.c:74(grab_named_mutex)
Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:01:03.756905, 1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2
(Interrupted system call)
[2012/07/18 20:01:03.757102, 0]
lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
tdb_chainlock_with_timeout_internal: alarm (10) timed out for key
UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:01:03.757260, 1] lib/server_mutex.c:74(grab_named_mutex)
Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:01:03.757420, 0] auth/auth_domain.c:292(domain_client_validate)
domain_client_validate: Domain password server not available.
[2012/07/18 20:01:03.757527, 2] auth/auth.c:319(check_ntlm_password)
check_ntlm_password: Authentication for user [db2z07] -> [db2z07] FAILED
with error NT_STATUS_NO_LOGON_SERVERS
After reading through the Samba source code it looks like whenever a new
session setup happens it tries to authenticate the user, but to do this it must
first lock a key in the mutex.tdb file. It tries to lock the key but fails
(three times) before giving up (presumably because another process has it
locked). Sadly, when unable to lock the key in the mutex TDB file, the code
throws a "NT_STATUS_NO_LOGON_SERVERS" (despite the fact it didn't try to
connect to a logon server) giving the message "Domain password server not
available".
When using ONE of our domain controllers - UOS-ADS00003-SI - no problems occur.
When Samba switches to using another domain controller (such as UOS-ADS00001-SI
or UOS-ADS00002-SI) then the errors (as shown in the above logs) occur again.
My current working theory is that there is a problem talking to some of our
domain controllers and one smbd locks the key in the mutex - preventing the
other smbd processes from getting a lock (and thus resulting the above logs).
Sadly we can't find what is holding the lock open and with 1800 processes open
(smbd processes) open most of the time it is very difficult to find out any
other errors talking to the domain controller. In the samba source code there
is an ironic comment in the mutex locking code:
>From source3/lib/util_tdb.c:
/* TODO: If we time out waiting for a lock, it might
* be nice to use F_GETLK to get the pid of the
* process currently holding the lock and print that
* as part of the debugging message. -- mbp */
Right now we've worked around the problem by forcing samba to use a particular
domain controller (password server = uos-ads00003-).
My questions are:
1. Can somebody implement the idea above which logs the PID of the process
which has the mutex key locked using F_GETLK
2. Why does samba switch between domain controllers every so often?
3. Can anybody think of a way to determine what is holding the lock and why it
is holding the lock?
Sadly I cannot replicate the problem on other Solaris or Linux systems running
Samba.
I'd greatly appreciate any help anybody can offer!
Cheers,
David Bell
UNIX Systems Administrator
University of Southampton
--
To unsubscribe from this list go to the following URL and read the
instructions: https://lists.samba.org/mailman/options/samba