Recently the backup of a particular DLE has started hanging consistently. I don't see any recent changes on the server except possibly a change from kernel 3.10.0-1160.62.1.el7 to 3.10.0-1160.66.1.el7 on Jun 2nd, but the problem didn't start until June 6th.
The tar command is stuck trying to write out the name of the directory to amgtar: root 11321 11315 33 Jun14 ? 03:55:46 /usr/bin/tar --create --verbose --verbose --block-number --file - --directory /export/backup/rufous --no-check-device --listed-incremental /var/lib/amanda/gnutar-lists/saga_export_backup_rufous_0.new --ignore-failed-read --totals --exclude-from /var/log/amanda/amgtar._export_backup_rufous.20220614211711.exclude . root 11432 11427 42 Jun14 ? 03:57:48 /usr/bin/tar --create --verbose --verbose --block-number --file - --directory /export/backup/rufous --no-check-device --listed-incremental /var/lib/amanda/gnutar-lists/saga_export_backup_rufous_0.new --ignore-failed-read --totals --exclude-from /var/log/amanda/amgtar._export_backup_rufous.20220614234714.exclude . # strace -f -s 500 -p 11321 strace: Process 11321 attached write(2, " ./export/home/orion-admin/ansible-boulder/.git/objects/db/", 59 # strace -f -s 500 -p 11432 strace: Process 11432 attached write(2, " ./export/home/orion-admin/ansible-boulder/.git/objects/db/", 59 amgtar is stuck writing output to amandad: # strace -s 200 -p 11427 strace: Process 11427 attached write(4, "e_pylibssh.libs/\ndrwxr-x--- orion-admin/orion-admin 0 2022-05-23 17:31 ./export/home/orion-admin/.local/lib/python3.8/site-packages/bcrypt/\ndrwxr-x--- orion-admin/orion-admin 0 2022-05-23 17:3"..., 4096 amgtar debug has: Tue Jun 14 23:47:14.104627682 2022: pid 11427: ??-amgtar : pid 11427 ruid 0 euid 27120 ppid 11418 version 3.6.0: start at Tue Jun 14 23:47:14 2022 Tue Jun 14 23:47:14.104731978 2022: pid 11427: ??-amgtar : version 3.6.0 Tue Jun 14 23:47:14.104957990 2022: pid 11427: ??-amgtar : state_stream: 156 Tue Jun 14 23:47:14.110961812 2022: pid 11427: ??-amgtar : pid 11427 ruid 0 euid 27120 ppid 11418 version 3.6.0: rename at Tue Jun 14 23:47:14 2022 Tue Jun 14 23:47:14.110991922 2022: pid 11427: ??-amgtar : GNUTAR-PATH /bin/tar Tue Jun 14 23:47:14.111004028 2022: pid 11427: ??-amgtar : GNUTAR-LISTDIR /var/lib/amanda/gnutar-lists Tue Jun 14 23:47:14.111013955 2022: pid 11427: ??-amgtar : ONE-FILE-SYSTEM no Tue Jun 14 23:47:14.111023583 2022: pid 11427: ??-amgtar : SPARSE yes Tue Jun 14 23:47:14.111033039 2022: pid 11427: ??-amgtar : NO-UNQUOTE no Tue Jun 14 23:47:14.111042417 2022: pid 11427: ??-amgtar : ATIME-PRESERVE no Tue Jun 14 23:47:14.111051865 2022: pid 11427: ??-amgtar : ACLS no Tue Jun 14 23:47:14.111061481 2022: pid 11427: ??-amgtar : SELINUX no Tue Jun 14 23:47:14.111121121 2022: pid 11427: ??-amgtar : XATTRS no Tue Jun 14 23:47:14.111130900 2022: pid 11427: ??-amgtar : CHECK-DEVICE no Tue Jun 14 23:47:14.111140474 2022: pid 11427: ??-amgtar : SIZE ^ *Total bytes written: [0-9][0-9]* Tue Jun 14 23:47:14.111150269 2022: pid 11427: ??-amgtar : IGNORE : Directory is new$ Tue Jun 14 23:47:14.111159886 2022: pid 11427: ??-amgtar : IGNORE : Directory has been renamed Tue Jun 14 23:47:14.111169511 2022: pid 11427: ??-amgtar : NORMAL ^could not open conf file Tue Jun 14 23:47:14.111179212 2022: pid 11427: ??-amgtar : NORMAL ^Elapsed time: Tue Jun 14 23:47:14.111309287 2022: pid 11427: ??-amgtar : NORMAL ^Throughput Tue Jun 14 23:47:14.111321160 2022: pid 11427: ??-amgtar : NORMAL : directory is on a different filesystem; not dumped Tue Jun 14 23:47:14.111331690 2022: pid 11427: ??-amgtar : NORMAL : socket ignored$ Tue Jun 14 23:47:14.111342121 2022: pid 11427: ??-amgtar : NORMAL : File .* shrunk by [0-9][0-9]* bytes, padding with zeros Tue Jun 14 23:47:14.111397543 2022: pid 11427: ??-amgtar : NORMAL : Cannot add file .*: No such file or directory$ Tue Jun 14 23:47:14.111407572 2022: pid 11427: ??-amgtar : NORMAL : Error exit delayed from previous errors Tue Jun 14 23:47:14.111417962 2022: pid 11427: ??-amgtar : ERROR amgtar: error Tue Jun 14 23:47:14.114473250 2022: pid 11427: ??-amgtar : Spawning "/bin/tar /bin/tar -x --no-check-device -f -" in pipeline Tue Jun 14 23:47:14.124455374 2022: pid 11431: ??-amgtar : Executing: /bin/tar '--version' Tue Jun 14 23:47:14.126144343 2022: pid 11427: ??-amgtar : Spawning "/usr/bin/tar /usr/bin/tar --create --verbose --verbose --block-number --file - --directory /export/backup/rufous --no-check-device --listed-incremental /var/lib/amanda/gnutar-lists/saga_export_backup_rufous_0.new --ignore-failed-read --totals --exclude-from /var/log/amanda/amgtar._export_backup_rufous.20220614234714.exclude ." in pipeline amandad debug has: Tue Jun 14 23:47:14.049156625 2022: pid 11417: ??-amandad : sending REP pkt: <<<<< CONNECT DATA 499999 MESG 499998 INDEX 499997 STATE 499996 OPTIONS features=ffffffff9efefbfffffffffffffff3fffbf79f00; >>>>> Tue Jun 14 23:47:14.049312862 2022: pid 11417: ??-amandad : received ACK pkt: <<<<< >>>>> Tue Jun 14 23:47:14.049335184 2022: pid 11417: ??-amandad : opening security stream (0x5567229f24f8) for fd 150 Tue Jun 14 23:47:14.049351969 2022: pid 11417: ??-amandad : opening security stream (0x5567229f2528) for fd 151 Tue Jun 14 23:47:14.049369455 2022: pid 11417: ??-amandad : opening security stream (0x5567229f2558) for fd 152 Tue Jun 14 23:47:14.049385426 2022: pid 11417: ??-amandad : opening security stream (0x5567229f2588) for fd 153 Tue Jun 14 23:47:14.049401254 2022: pid 11417: ??-amandad : security_close(handle=0x5567229e9780, driver=0x7f40c660f540 (BSDTCP)) Tue Jun 14 23:47:14.049414836 2022: pid 11417: ??-amandad : security_stream_close(0x5567229e98d0) Tue Jun 14 23:47:14.049631091 2022: pid 11417: ??-amandad : ignoring close stream 2 Tue Jun 14 23:47:14.096232682 2022: pid 11417: ??-amandad +9 : shm_ring_consumer_set_size Tue Jun 14 23:47:14.096611706 2022: pid 11417: ??-amandad +9 : shm_ring_to_security_stream Wed Jun 15 02:17:15.145660919 2022: pid 11417: ??-amandad : stream_read_callback: data is still flowing Wed Jun 15 02:17:15.154334223 2022: pid 11417: ??-amandad : security_stream_seterr(0x556722a0bc10, EOF) Wed Jun 15 02:17:15.154365691 2022: pid 11417: ??-amandad : process_writenetfd 13: 0 amandad seems to stuck in futex: # strace -f -p 11417 strace: Process 11417 attached with 2 threads [pid 11417] futex(0x7f40c0d2a9d0, FUTEX_WAIT, 11426, NULL <unfinished ...> [pid 11426] restart_syscall(<... resuming interrupted read ...> So a I guess a resource lock race of some sort. Interesting that it is so consistent though. gdb reports: Thread 2 (Thread 0x7f40c0d2a700 (LWP 11426)): #0 0x00007f40c433fd2f in do_futex_wait () from /lib64/libpthread.so.0 #1 0x00007f40c433fe07 in __new_sem_wait_slow () from /lib64/libpthread.so.0 #2 0x00007f40c433fea5 in sem_timedwait () from /lib64/libpthread.so.0 #3 0x00007f40c63ca785 in shm_ring_sem_wait (shm_ring=shm_ring@entry=0x5567229fae60, sem=0x7f40c6a85000) at shm-ring.c:231 #4 0x00007f40c63cace6 in shm_ring_to_security_stream (shm_ring=0x5567229fae60, netfd=0x556722a0bc10, crc=0x0) at shm-ring.c:410 #5 0x0000556720fdad0b in shm_ring_thread (cookie=0x5567229f24f8) at amandad.c:1577 #6 0x00007f40c5a7d5c5 in g_thread_proxy () from /lib64/libglib-2.0.so.0 #7 0x00007f40c4339ea5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f40c4062b0d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f40c6a69840 (LWP 11417)): #0 0x00007f40c433b017 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007f40c5a9afc7 in g_system_thread_wait () from /lib64/libglib-2.0.so.0 #2 0x00007f40c5a7da25 in g_thread_join () from /lib64/libglib-2.0.so.0 #3 0x0000556720fdbfd5 in process_writenetfd (cookie=0x5567229f24f8, buf=<optimized out>, size=<optimized out>) at amandad.c:1609 #4 0x00007f40c63c6fc0 in sec_tcp_conn_read_callback (cookie=0x5567229e79b0) at security-util.c:2474 #5 0x00007f40c63b80de in event_handle_callback (user_ptr=0x5567229e7f50) at event.c:118 #6 0x00007f40c5a56119 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #7 0x00007f40c5a56478 in g_main_context_iterate.isra.19 () from /lib64/libglib-2.0.so.0 #8 0x00007f40c5a5652c in g_main_context_iteration () from /lib64/libglib-2.0.so.0 #9 0x00007f40c63b7fdd in event_loop_wait (wait_eh=0x0, nonblock=0, return_when_empty=1) at event.c:427 #10 0x00007f40c63b841e in event_loop (nonblock=<optimized out>) at event.c:328 #11 0x0000556720fd9d76 in main (argc=5, argv=0x7ffcbd7f2508) at amandad.c:535 (gdb) list 1604 full_write(dh->fd_write, buf, (size_t)size); 1605 } else { 1606 g_debug("process_writenetfd %d: %zd", dh->fd_write, size); 1607 aclose(dh->fd_write); 1608 if (as->thread && dh->shm_ring) { 1609 g_thread_join(as->thread); 1610 close_consumer_shm_ring(dh->shm_ring); 1611 dh->shm_ring = NULL; 1612 as->thread = NULL; 1613 } So it looks like a write to fd 13 failed and now it's trying to join the other thread but the other thread is waiting for something else? Not sure what fd 13 was. amdump: driver: hdisk-state time 13625.015 hdisk 0: free 897061131 dumpers 2 driver: result time 13625.015 from chunker1: FAILED 01-00002 "Got empty header" driver: send-cmd time 13625.015 to chunker1: QUIT dumper: kill index command driver: state time 13625.017 free kps: 7480 space: 897061131 taper: down idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle: no-bandwidth driver: interface-state time 13625.017 if default: free 7480 driver: hdisk-state time 13625.017 hdisk 0: free 897061131 dumpers 2 driver: result time 13625.017 from dumper1: FAILED 01-00002 "[data timeout]" driver: requeue dump time 13625.024 saga /export/backup/rufous 20220614200007 driver: started chunker1 pid 11409 driver: send-cmd time 13625.043 to chunker1: START 20220614200007 driver: send-cmd time 13625.043 to chunker1: SHM-WRITE 01-00025 /export/amanda/Data/data/holding/20220614200007/saga._export_backup_rufous.0 saga ffffffff9efefbfffffffffffffff3fffbf79f00 /export/backup/rufous 0 20220614200007 9007199254740896 APPLICATION 91707744 |;auth=BSDTCP;comp-cust=/usr/local/bin/zstd-amanda;index;exclude-list=/usr/local/lib/amanda/exclude.gtar; driver: state time 13625.043 free kps: 7480 space: 897061131 taper: down idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle: no-bandwidth driver: interface-state time 13625.043 if default: free 7480 driver: hdisk-state time 13625.043 hdisk 0: free 897061131 dumpers 2 driver: state time 13625.525 free kps: 7480 space: 897061131 taper: down idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle: no-bandwidth driver: interface-state time 13625.525 if default: free 7480 driver: hdisk-state time 13625.525 hdisk 0: free 897061131 dumpers 2 driver: result time 13625.525 from chunker1: SHM-NAME 01-00025 45615 /amanda_shm_control-11409-0 driver: send-cmd time 13625.525 to dumper1: SHM-DUMP 01-00025 45615 NULL 2 saga ffffffff9efefbfffffffffffffff3fffbf79f00 /export/backup/rufous NODEVICE 0 1970:1:1:0:0:0 amgtar "" "" "" "" "" "" "" 1 "" "" BSDTCP AMANDA /amanda_shm_control-11409-0 20 |" <auth>BSDTCP</auth>\n <compress>CUSTOM<custom-compress-program>/usr/local/bin/zstd-amanda</custom-compress-program>\n </compress>\n <record>YES</record>\n <index>YES</index>\n <datapath>AMANDA</datapath>\n <exclude>\n <list>/usr/local/lib/amanda/exclude.gtar</list>\n </exclude>\n"""" <backup-program>\n <plugin>amgtar</plugin>\n <property>\n <name>atime-preserve</name>\n <value>no</value>\n </property>\n <property>\n <name>one-file-system</name>\n <value>no</value>\n </property>\n <property>\n <name>check-device</name>\n <value>NO</value>\n </property>\n </backup-program>\n" driver: state time 14304.799 free kps: 7480 space: 897061131 taper: down idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle: no-bandwidth driver: hdisk-state time 22627.693 hdisk 0: free 889377437 dumpers 1 driver: result time 22627.694 from chunker1: FAILED 01-00025 "Got empty header" driver: send-cmd time 22627.694 to chunker1: QUIT driver: state time 22627.694 free kps: 54377 space: 889377437 taper: down idle-dumpers: 5 qlen runq: 24 directq: 0 roomq: 0 wakeup: 0 driver-idle: no-bandwidth driver: interface-state time 22627.694 if default: free 54377 driver: hdisk-state time 22627.694 hdisk 0: free 889377437 dumpers 1 driver: result time 22627.694 from dumper1: FAILED 01-00025 "[data timeout]" -- Orion Poplawski IT Systems Manager 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane or...@nwra.com Boulder, CO 80301 https://www.nwra.com/
smime.p7s
Description: S/MIME Cryptographic Signature