Hello,

We are using Samba 3.4.6 (packaged by opencsw.org) against Active Directory 
2003 on our primary University filestore. The operating system is Solaris 10 
Update 10. We have a number of domain controllers. For the past two days on our 
main filestore has been failing connections from a number of clients.

When using smbclient (or indeed any client) connecting to the Samba server we 
see logs similar  the following with log level of 3:

[2012/07/18 20:00:33.762539,  3] libsmb/namequery.c:2461(get_dc_list)
  get_dc_list: preferred server list: ", *"
[2012/07/18 20:00:43.756966,  1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
  tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2 
(Interrupted system call)
[2012/07/18 20:00:43.757104,  0] 
lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
  tdb_chainlock_with_timeout_internal: alarm (10) timed out for key 
UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:00:43.757214,  1] lib/server_mutex.c:74(grab_named_mutex)
  Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:00:53.756881,  1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
  tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2 
(Interrupted system call)
[2012/07/18 20:00:53.757009,  0] 
lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
  tdb_chainlock_with_timeout_internal: alarm (10) timed out for key 
UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:00:53.757130,  1] lib/server_mutex.c:74(grab_named_mutex)
  Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:01:03.756905,  1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
  tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2 
(Interrupted system call)
[2012/07/18 20:01:03.757102,  0] 
lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
  tdb_chainlock_with_timeout_internal: alarm (10) timed out for key 
UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:01:03.757260,  1] lib/server_mutex.c:74(grab_named_mutex)
  Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:01:03.757420,  0] auth/auth_domain.c:292(domain_client_validate)
  domain_client_validate: Domain password server not available.
[2012/07/18 20:01:03.757527,  2] auth/auth.c:319(check_ntlm_password)
  check_ntlm_password:  Authentication for user [db2z07] -> [db2z07] FAILED 
with error NT_STATUS_NO_LOGON_SERVERS

After reading through the Samba source code it looks like whenever a new 
session setup happens it tries to authenticate the user, but to do this it must 
first lock a key in the mutex.tdb file. It tries to lock the key but fails 
(three times) before giving up (presumably because another process has it 
locked). Sadly, when unable to lock the key in the mutex TDB file, the code 
throws a "NT_STATUS_NO_LOGON_SERVERS" (despite the fact it didn't try to 
connect to a logon server) giving the message "Domain password server not 
available".

When using ONE of our domain controllers - UOS-ADS00003-SI - no problems occur. 
When Samba switches to using another domain controller (such as UOS-ADS00001-SI 
or UOS-ADS00002-SI) then the errors (as shown in the above logs) occur again. 
My current working theory is that there is a problem talking to some of our 
domain controllers and one smbd locks the key in the mutex - preventing the 
other smbd processes from getting a lock (and thus resulting the above logs).

Sadly we can't find what is holding the lock open and with 1800 processes open 
(smbd processes) open most of the time it is very difficult to find out any 
other errors talking to the domain controller. In the samba source code there 
is an ironic comment in the mutex locking code:

>From source3/lib/util_tdb.c:

/* TODO: If we time out waiting for a lock, it might
                         * be nice to use F_GETLK to get the pid of the
                         * process currently holding the lock and print that
                         * as part of the debugging message. -- mbp */

Right now we've worked around the problem by forcing samba to use a particular 
domain controller (password server = uos-ads00003-). 

My questions are:

1. Can somebody implement the idea above which logs the PID of the process 
which has the mutex key locked using F_GETLK
2. Why does samba switch between domain controllers every so often?
3. Can anybody think of a way to determine what is holding the lock and why it 
is holding the lock?

Sadly I cannot replicate the problem on other Solaris or Linux systems running 
Samba. 

I'd greatly appreciate any help anybody can offer!

Cheers,

David Bell
UNIX Systems Administrator
University of Southampton
-- 
To unsubscribe from this list go to the following URL and read the
instructions:  https://lists.samba.org/mailman/options/samba

Reply via email to