I've been fighting with a samba install on top of a amd64 Ubuntu 10.4 system for a couple months now. I have a small flotilla of Windows systems running anything from Win 2000 up through Windows Server 2008. All save for the 2008 systems are setup with nightly NTBackup jobs with my samba server as the destination. The 2008 systems are running MS SQL and use the samba server as the destination for their scheduled DB dumps.
Samba is setup with a single share, and a single role login/password stored locally. 99% of the time this system has worked well. Randomly at night though anywhere from one to all systems will fail their backup jobs at the same time. The failure window rolls around between 11pm and 2am. There isn't a pattern to the machines that fail, and more annoyingly, most of the time only a few fail while others backing up at the same time show no issues. I can't find any cron jobs, system processes or network spikes to blame the problem on. So far I've done the following trying to stabilize this system: - Upgraded to a mainline kernel due to a rather big bug in Ubuntu 10.4 that would cause samba to flake out over the course of a week, requiring a reboot to resolve. (Launchpad bugid 474089) - Turned up samba's logging (lvl 2) to see what it reports - Locked samba to port 139, used iptables to drop all traffic to port 445 - Verified there are no limits being hit (nofiles was at 1024 for awhile, took a bit to figure out upstart's syntax for raising that limit...) - Upgraded from 3.4.2 to 3.5.4 using Maverick's packages - Tuned kernel settings for gigabit ethernet - Tuned samba's settings based on various online discussions The problems still persist. Looking at last night's failure I found the following log messages: log.smbd [2010/08/25 00:37:15.551263, 1] smbd/server.c:267(remove_child_pid) Scheduled cleanup of brl and lock database after unclean shutdown [2010/08/25 00:37:35.551964, 1] smbd/server.c:240(cleanup_timeout_fn) Cleaning up brl and lock database after unclean shutdown Those are the only messages logged for last night, and line up exactly with the backup failures. The smbd process was last restarted on 8/23 at 13:51. The night before I had a couple systems fail also and had the following log entries: [2010/08/24 00:49:59.265666, 1] smbd/server.c:267(remove_child_pid) Scheduled cleanup of brl and lock database after unclean shutdown [2010/08/24 00:50:19.281235, 1] smbd/server.c:240(cleanup_timeout_fn) Cleaning up brl and lock database after unclean shutdown [2010/08/24 00:50:19.281454, 2] lib/messages_local.c:289(message_notify) message to process 1828 failed - No such process [2010/08/24 00:50:19.281475, 2] lib/messages_local.c:379(messaging_tdb_send) pid 1828 doesn't exist - deleting messages record [2010/08/24 00:50:19.281488, 2] lib/messages.c:127(traverse_fn) pid 1828 doesn't exist - deleting connections -1 [] [2010/08/24 00:50:19.411331, 2] lib/messages_local.c:289(message_notify) message to process 1838 failed - No such process [2010/08/24 00:50:19.411373, 2] lib/messages_local.c:379(messaging_tdb_send) pid 1838 doesn't exist - deleting messages record [2010/08/24 00:50:19.411394, 2] lib/messages.c:127(traverse_fn) pid 1838 doesn't exist - deleting connections -1 [] [2010/08/24 00:50:19.411431, 2] lib/messages_local.c:289(message_notify) message to process 1712 failed - No such process [2010/08/24 00:50:19.411443, 2] lib/messages_local.c:379(messaging_tdb_send) pid 1712 doesn't exist - deleting messages record [2010/08/24 00:50:19.411467, 2] lib/messages.c:127(traverse_fn) pid 1712 doesn't exist - deleting connections -1 [] [2010/08/24 00:50:19.411507, 2] lib/messages_local.c:289(message_notify) message to process 999 failed - No such process [2010/08/24 00:50:19.411519, 2] lib/messages_local.c:379(messaging_tdb_send) pid 999 doesn't exist - deleting messages record [2010/08/24 00:50:19.411531, 2] lib/messages.c:127(traverse_fn) pid 999 doesn't exist - deleting connections -1 [] [2010/08/24 00:50:19.411583, 2] lib/messages_local.c:289(message_notify) message to process 1848 failed - No such process [2010/08/24 00:50:19.411598, 2] lib/messages_local.c:379(messaging_tdb_send) pid 1848 doesn't exist - deleting messages record [2010/08/24 00:50:19.411626, 2] lib/messages.c:127(traverse_fn) pid 1848 doesn't exist - deleting connections -1 [] [2010/08/24 00:50:19.411707, 2] lib/messages_local.c:289(message_notify) message to process 1713 failed - No such process [2010/08/24 00:50:19.411721, 2] lib/messages_local.c:379(messaging_tdb_send) pid 1713 doesn't exist - deleting messages record [2010/08/24 00:50:19.411743, 2] lib/messages.c:127(traverse_fn) pid 1713 doesn't exist - deleting connections -1 [] [2010/08/24 00:50:19.411794, 2] lib/messages_local.c:289(message_notify) message to process 1892 failed - No such process [2010/08/24 00:50:19.411823, 2] lib/messages_local.c:379(messaging_tdb_send) pid 1892 doesn't exist - deleting messages record [2010/08/24 00:50:19.411837, 2] lib/messages.c:127(traverse_fn) pid 1892 doesn't exist - deleting connections -1 [] Looking at one of the machine specific logs for that window I saw: [2010/08/24 00:49:35.090722, 2] smbd/sesssetup.c:1390(setup_new_vc_session) setup_new_vc_session: New VC == 0, if NT4.x compatible we would close all old resources. [2010/08/24 00:49:35.098893, 2] smbd/sesssetup.c:1390(setup_new_vc_session) setup_new_vc_session: New VC == 0, if NT4.x compatible we would close all old resources. [2010/08/24 00:49:35.273094, 2] auth/auth.c:304(check_ntlm_password) check_ntlm_password: authentication for user [backup] -> [backup] -> [backup] succeeded [2010/08/24 00:49:35.294169, 1] smbd/service.c:1070(make_connection_snum) ravenclaw (192.168.1.8) connect to service backups initially as user backup (u id=34, gid=34) (pid 17741) [2010/08/24 00:49:35.332337, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=1) [2010/08/24 00:49:35.333801, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=0) NT_STATUS_OK [2010/08/24 00:49:35.334524, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=1) [2010/08/24 00:49:35.335791, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=0) NT_STATUS_OK [2010/08/24 00:49:35.336507, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=1) [2010/08/24 00:49:35.337803, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=0) NT_STATUS_OK [2010/08/24 00:49:35.338367, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=1) [2010/08/24 00:49:35.339624, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=0) NT_STATUS_OK [2010/08/24 00:49:35.768280, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=1) [2010/08/24 00:49:35.769585, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=0) NT_STATUS_OK [2010/08/24 00:49:35.770204, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=1) [2010/08/24 00:49:35.771495, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=0) NT_STATUS_OK [2010/08/24 00:49:59.104971, 0] lib/util_sock.c:675(write_data) [2010/08/24 00:49:59.105025, 0] lib/util_sock.c:1432(get_peer_addr_internal) getpeername failed. Error was Transport endpoint is not connected write_data: write failure in writing to client 0.0.0.0. Error Connection reset by peer [2010/08/24 00:49:59.167351, 0] smbd/process.c:79(srv_send_smb) Error writing 51 bytes to client. -1. (Transport endpoint is not connected) [2010/08/24 00:49:59.167981, 0] lib/util_sock.c:675(write_data) [2010/08/24 00:49:59.167995, 0] lib/util_sock.c:1432(get_peer_addr_internal) getpeername failed. Error was Transport endpoint is not connected write_data: write failure in writing to client 0.0.0.0. Error Broken pipe [2010/08/24 00:49:59.168006, 0] smbd/process.c:79(srv_send_smb) Error writing 55 bytes to client. -1. (Transport endpoint is not connected) [2010/08/24 00:49:59.168172, 0] lib/util_sock.c:675(write_data) [2010/08/24 00:49:59.168184, 0] lib/util_sock.c:1432(get_peer_addr_internal) getpeername failed. Error was Transport endpoint is not connected write_data: write failure in writing to client 0.0.0.0. Error Broken pipe [2010/08/24 00:49:59.168195, 0] smbd/process.c:79(srv_send_smb) Error writing 51 bytes to client. -1. (Transport endpoint is not connected) [2010/08/24 00:49:59.257943, 2] smbd/close.c:656(close_normal_file) backup closed file Windows Servers/Ravenclaw/RAVENCLAW.bkf (numopen=0) NT_STAT US_OK [2010/08/24 00:49:59.257981, 1] smbd/service.c:1251(close_cnum) ravenclaw (192.168.1.8) closed connection to service backups [2010/08/24 00:49:59.258013, 2] smbd/open.c:631(open_file) backup opened file Windows Servers/Ravenclaw/RAVENCLAW.bkf read=Yes write=Yes (numopen=1) [2010/08/24 00:49:59.260011, 2] smbd/open.c:631(open_file) backup opened file Windows Servers/Ravenclaw/RAVENCLAW.bkf read=No write=No (n umopen=2) [2010/08/24 00:49:59.261185, 2] smbd/close.c:656(close_normal_file) backup closed file Windows Servers/Ravenclaw/RAVENCLAW.bkf (numopen=1) NT_STAT US_OK [2010/08/24 00:49:59.261888, 2] smbd/open.c:631(open_file) backup opened file Windows Servers/Ravenclaw/RAVENCLAW.bkf read=No write=No (n umopen=2) [2010/08/24 00:49:59.263221, 2] smbd/close.c:656(close_normal_file) backup closed file Windows Servers/Ravenclaw/RAVENCLAW.bkf (numopen=1) NT_STAT US_OK [2010/08/24 00:49:59.894074, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=2) [2010/08/24 00:49:59.895333, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:49:59.895893, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=2) [2010/08/24 00:49:59.897157, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:49:59.897748, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=2) [2010/08/24 00:49:59.899061, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:49:59.899632, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=2) [2010/08/24 00:49:59.900945, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:50:00.092755, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=2) [2010/08/24 00:50:00.094545, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:50:00.095506, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=2) [2010/08/24 00:50:00.097609, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:50:00.098641, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=Yes write=Yes (numopen=2) [2010/08/24 00:50:00.100695, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:50:00.101654, 2] smbd/open.c:631(open_file) backup opened file A7B4G5J1.BJ3 read=No write=No (numopen=2) [2010/08/24 00:50:00.103585, 2] smbd/close.c:656(close_normal_file) backup closed file A7B4G5J1.BJ3 (numopen=1) NT_STATUS_OK [2010/08/24 00:50:01.508152, 2] smbd/close.c:656(close_normal_file) backup closed file Windows Servers/Ravenclaw/RAVENCLAW.bkf (numopen=0) NT_STAT US_OK [2010/08/24 00:50:03.032069, 1] smbd/service.c:1251(close_cnum) ravenclaw (192.168.1.8) closed connection to service backups That system started it's backup run at 8pm. Based on the above I found a few conversations online of others with the same symptom set and the suggestion to use iptables to drop port 445 in addition to disabling it in samba's conf. I've got the same logs for last night's run so obviously the iptables tweak didn't help. I'm at my wits end trying to sort this, anyone have any suggestions? My conf for good measure: [global] domain master = no local master = no lock directory = /var/run/samba smb ports = 139 workgroup = WORKGROUP server string = %h server (Samba, Ubuntu) wins support = no dns proxy = no log file = /var/log/samba/log.%m log level = 2 max log size = 1000 syslog = 0 panic action = /usr/share/samba/panic-action %d security = user encrypt passwords = true passdb backend = tdbsam obey pam restrictions = yes unix password sync = yes passwd program = /usr/bin/passwd %u passwd chat = *Enter\snew\s*\spassword:* %n\n *Retype\snew\s*\spassword:* %n\ n *password\supdated\ssuccessfully* . pam password change = yes map to guest = bad user socket options = TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=65536 SO_SNDBUF=65536 strict sync = no sync always = no load printers = no use sendfile = yes write cache size = 262144 max xmit = 65535 fake oplocks = no oplocks = yes # max protocol = SMB2 [backups] comment = Windows Backups path = /mnt/Storage/Disk2/Biddeford/Tier-2 browsable = yes guest ok = no read only = no create mask = 0755 valid users = backup user = backup Joshua Coombs -- To unsubscribe from this list go to the following URL and read the instructions: https://lists.samba.org/mailman/options/samba
