Recently the backup of a particular DLE has started hanging consistently.  I
don't see any recent changes on the server except possibly a change from
kernel 3.10.0-1160.62.1.el7 to 3.10.0-1160.66.1.el7 on Jun 2nd, but the
problem didn't start until June 6th.

The tar command is stuck trying to write out the name of the directory to 
amgtar:

root     11321 11315 33 Jun14 ?        03:55:46 /usr/bin/tar --create
--verbose --verbose --block-number --file - --directory /export/backup/rufous
--no-check-device --listed-incremental
/var/lib/amanda/gnutar-lists/saga_export_backup_rufous_0.new
--ignore-failed-read --totals --exclude-from
/var/log/amanda/amgtar._export_backup_rufous.20220614211711.exclude .
root     11432 11427 42 Jun14 ?        03:57:48 /usr/bin/tar --create
--verbose --verbose --block-number --file - --directory /export/backup/rufous
--no-check-device --listed-incremental
/var/lib/amanda/gnutar-lists/saga_export_backup_rufous_0.new
--ignore-failed-read --totals --exclude-from
/var/log/amanda/amgtar._export_backup_rufous.20220614234714.exclude .

# strace -f -s 500 -p 11321
strace: Process 11321 attached
write(2, " ./export/home/orion-admin/ansible-boulder/.git/objects/db/", 59

# strace -f -s 500 -p 11432
strace: Process 11432 attached
write(2, " ./export/home/orion-admin/ansible-boulder/.git/objects/db/", 59

amgtar is stuck writing output to amandad:

# strace -s 200 -p 11427
strace: Process 11427 attached
write(4, "e_pylibssh.libs/\ndrwxr-x--- orion-admin/orion-admin     0
2022-05-23 17:31
./export/home/orion-admin/.local/lib/python3.8/site-packages/bcrypt/\ndrwxr-x---
orion-admin/orion-admin     0 2022-05-23 17:3"..., 4096


amgtar debug has:

Tue Jun 14 23:47:14.104627682 2022: pid 11427: ??-amgtar   : pid 11427 ruid 0
euid 27120 ppid 11418 version 3.6.0: start at Tue Jun 14 23:47:14 2022
Tue Jun 14 23:47:14.104731978 2022: pid 11427: ??-amgtar   : version 3.6.0
Tue Jun 14 23:47:14.104957990 2022: pid 11427: ??-amgtar   : state_stream: 156
Tue Jun 14 23:47:14.110961812 2022: pid 11427: ??-amgtar   : pid 11427 ruid 0
euid 27120 ppid 11418 version 3.6.0: rename at Tue Jun 14 23:47:14 2022
Tue Jun 14 23:47:14.110991922 2022: pid 11427: ??-amgtar   : GNUTAR-PATH 
/bin/tar
Tue Jun 14 23:47:14.111004028 2022: pid 11427: ??-amgtar   : GNUTAR-LISTDIR
/var/lib/amanda/gnutar-lists
Tue Jun 14 23:47:14.111013955 2022: pid 11427: ??-amgtar   : ONE-FILE-SYSTEM no
Tue Jun 14 23:47:14.111023583 2022: pid 11427: ??-amgtar   : SPARSE yes
Tue Jun 14 23:47:14.111033039 2022: pid 11427: ??-amgtar   : NO-UNQUOTE no
Tue Jun 14 23:47:14.111042417 2022: pid 11427: ??-amgtar   : ATIME-PRESERVE no
Tue Jun 14 23:47:14.111051865 2022: pid 11427: ??-amgtar   : ACLS no
Tue Jun 14 23:47:14.111061481 2022: pid 11427: ??-amgtar   : SELINUX no
Tue Jun 14 23:47:14.111121121 2022: pid 11427: ??-amgtar   : XATTRS no
Tue Jun 14 23:47:14.111130900 2022: pid 11427: ??-amgtar   : CHECK-DEVICE no
Tue Jun 14 23:47:14.111140474 2022: pid 11427: ??-amgtar   : SIZE ^ *Total
bytes written: [0-9][0-9]*
Tue Jun 14 23:47:14.111150269 2022: pid 11427: ??-amgtar   : IGNORE :
Directory is new$
Tue Jun 14 23:47:14.111159886 2022: pid 11427: ??-amgtar   : IGNORE :
Directory has been renamed
Tue Jun 14 23:47:14.111169511 2022: pid 11427: ??-amgtar   : NORMAL ^could not
open conf file
Tue Jun 14 23:47:14.111179212 2022: pid 11427: ??-amgtar   : NORMAL ^Elapsed 
time:
Tue Jun 14 23:47:14.111309287 2022: pid 11427: ??-amgtar   : NORMAL ^Throughput
Tue Jun 14 23:47:14.111321160 2022: pid 11427: ??-amgtar   : NORMAL :
directory is on a different filesystem; not dumped
Tue Jun 14 23:47:14.111331690 2022: pid 11427: ??-amgtar   : NORMAL : socket
ignored$
Tue Jun 14 23:47:14.111342121 2022: pid 11427: ??-amgtar   : NORMAL : File .*
shrunk by [0-9][0-9]* bytes, padding with zeros
Tue Jun 14 23:47:14.111397543 2022: pid 11427: ??-amgtar   : NORMAL : Cannot
add file .*: No such file or directory$
Tue Jun 14 23:47:14.111407572 2022: pid 11427: ??-amgtar   : NORMAL : Error
exit delayed from previous errors
Tue Jun 14 23:47:14.111417962 2022: pid 11427: ??-amgtar   : ERROR amgtar: error
Tue Jun 14 23:47:14.114473250 2022: pid 11427: ??-amgtar   : Spawning
"/bin/tar /bin/tar -x --no-check-device -f -" in pipeline
Tue Jun 14 23:47:14.124455374 2022: pid 11431: ??-amgtar   : Executing:
/bin/tar '--version'
Tue Jun 14 23:47:14.126144343 2022: pid 11427: ??-amgtar   : Spawning
"/usr/bin/tar /usr/bin/tar --create --verbose --verbose --block-number --file
- --directory /export/backup/rufous --no-check-device --listed-incremental
/var/lib/amanda/gnutar-lists/saga_export_backup_rufous_0.new
--ignore-failed-read --totals --exclude-from
/var/log/amanda/amgtar._export_backup_rufous.20220614234714.exclude ." in 
pipeline

amandad debug has:

Tue Jun 14 23:47:14.049156625 2022: pid 11417: ??-amandad   : sending REP pkt:
<<<<<
CONNECT DATA 499999 MESG 499998 INDEX 499997 STATE 499996
OPTIONS features=ffffffff9efefbfffffffffffffff3fffbf79f00;
>>>>>
Tue Jun 14 23:47:14.049312862 2022: pid 11417: ??-amandad   : received ACK pkt:
<<<<<
>>>>>
Tue Jun 14 23:47:14.049335184 2022: pid 11417: ??-amandad   : opening security
stream (0x5567229f24f8) for fd 150
Tue Jun 14 23:47:14.049351969 2022: pid 11417: ??-amandad   : opening security
stream (0x5567229f2528) for fd 151
Tue Jun 14 23:47:14.049369455 2022: pid 11417: ??-amandad   : opening security
stream (0x5567229f2558) for fd 152
Tue Jun 14 23:47:14.049385426 2022: pid 11417: ??-amandad   : opening security
stream (0x5567229f2588) for fd 153
Tue Jun 14 23:47:14.049401254 2022: pid 11417: ??-amandad   :
security_close(handle=0x5567229e9780, driver=0x7f40c660f540 (BSDTCP))
Tue Jun 14 23:47:14.049414836 2022: pid 11417: ??-amandad   :
security_stream_close(0x5567229e98d0)
Tue Jun 14 23:47:14.049631091 2022: pid 11417: ??-amandad   : ignoring close
stream 2
Tue Jun 14 23:47:14.096232682 2022: pid 11417: ??-amandad +9   :
shm_ring_consumer_set_size
Tue Jun 14 23:47:14.096611706 2022: pid 11417: ??-amandad +9   :
shm_ring_to_security_stream
Wed Jun 15 02:17:15.145660919 2022: pid 11417: ??-amandad   :
stream_read_callback: data is still flowing
Wed Jun 15 02:17:15.154334223 2022: pid 11417: ??-amandad   :
security_stream_seterr(0x556722a0bc10, EOF)
Wed Jun 15 02:17:15.154365691 2022: pid 11417: ??-amandad   :
process_writenetfd 13: 0

amandad seems to stuck in futex:

# strace -f -p 11417
strace: Process 11417 attached with 2 threads
[pid 11417] futex(0x7f40c0d2a9d0, FUTEX_WAIT, 11426, NULL <unfinished ...>
[pid 11426] restart_syscall(<... resuming interrupted read ...>

So a I guess a resource lock race of some sort.  Interesting that it is so
consistent though.

gdb reports:

Thread 2 (Thread 0x7f40c0d2a700 (LWP 11426)):
#0  0x00007f40c433fd2f in do_futex_wait () from /lib64/libpthread.so.0
#1  0x00007f40c433fe07 in __new_sem_wait_slow () from /lib64/libpthread.so.0
#2  0x00007f40c433fea5 in sem_timedwait () from /lib64/libpthread.so.0
#3  0x00007f40c63ca785 in shm_ring_sem_wait
(shm_ring=shm_ring@entry=0x5567229fae60, sem=0x7f40c6a85000) at shm-ring.c:231
#4  0x00007f40c63cace6 in shm_ring_to_security_stream
(shm_ring=0x5567229fae60, netfd=0x556722a0bc10, crc=0x0) at shm-ring.c:410
#5  0x0000556720fdad0b in shm_ring_thread (cookie=0x5567229f24f8) at
amandad.c:1577
#6  0x00007f40c5a7d5c5 in g_thread_proxy () from /lib64/libglib-2.0.so.0
#7  0x00007f40c4339ea5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f40c4062b0d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f40c6a69840 (LWP 11417)):
#0  0x00007f40c433b017 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f40c5a9afc7 in g_system_thread_wait () from /lib64/libglib-2.0.so.0
#2  0x00007f40c5a7da25 in g_thread_join () from /lib64/libglib-2.0.so.0
#3  0x0000556720fdbfd5 in process_writenetfd (cookie=0x5567229f24f8,
buf=<optimized out>, size=<optimized out>) at amandad.c:1609
#4  0x00007f40c63c6fc0 in sec_tcp_conn_read_callback (cookie=0x5567229e79b0)
at security-util.c:2474
#5  0x00007f40c63b80de in event_handle_callback (user_ptr=0x5567229e7f50) at
event.c:118
#6  0x00007f40c5a56119 in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#7  0x00007f40c5a56478 in g_main_context_iterate.isra.19 () from
/lib64/libglib-2.0.so.0
#8  0x00007f40c5a5652c in g_main_context_iteration () from 
/lib64/libglib-2.0.so.0
#9  0x00007f40c63b7fdd in event_loop_wait (wait_eh=0x0, nonblock=0,
return_when_empty=1) at event.c:427
#10 0x00007f40c63b841e in event_loop (nonblock=<optimized out>) at event.c:328
#11 0x0000556720fd9d76 in main (argc=5, argv=0x7ffcbd7f2508) at amandad.c:535

(gdb) list
1604            full_write(dh->fd_write, buf, (size_t)size);
1605        } else {
1606            g_debug("process_writenetfd %d: %zd", dh->fd_write, size);
1607            aclose(dh->fd_write);
1608            if (as->thread && dh->shm_ring) {
1609                g_thread_join(as->thread);
1610                close_consumer_shm_ring(dh->shm_ring);
1611                dh->shm_ring = NULL;
1612                as->thread = NULL;
1613            }

So it looks like a write to fd 13 failed and now it's trying to join the other
thread but the other thread is waiting for something else?

Not sure what fd 13 was.

amdump:

driver: hdisk-state time 13625.015 hdisk 0: free 897061131 dumpers 2
driver: result time 13625.015 from chunker1: FAILED 01-00002 "Got empty header"
driver: send-cmd time 13625.015 to chunker1: QUIT
dumper: kill index command
driver: state time 13625.017 free kps: 7480 space: 897061131 taper: down
idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle:
no-bandwidth
driver: interface-state time 13625.017 if default: free 7480
driver: hdisk-state time 13625.017 hdisk 0: free 897061131 dumpers 2
driver: result time 13625.017 from dumper1: FAILED 01-00002 "[data timeout]"


driver: requeue dump time 13625.024 saga /export/backup/rufous 20220614200007
driver: started chunker1 pid 11409
driver: send-cmd time 13625.043 to chunker1: START 20220614200007
driver: send-cmd time 13625.043 to chunker1: SHM-WRITE 01-00025
/export/amanda/Data/data/holding/20220614200007/saga._export_backup_rufous.0
saga ffffffff9efefbfffffffffffffff3fffbf79f00 /export/backup/rufous 0
20220614200007 9007199254740896 APPLICATION 91707744
|;auth=BSDTCP;comp-cust=/usr/local/bin/zstd-amanda;index;exclude-list=/usr/local/lib/amanda/exclude.gtar;
driver: state time 13625.043 free kps: 7480 space: 897061131 taper: down
idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle:
no-bandwidth
driver: interface-state time 13625.043 if default: free 7480
driver: hdisk-state time 13625.043 hdisk 0: free 897061131 dumpers 2
driver: state time 13625.525 free kps: 7480 space: 897061131 taper: down
idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle:
no-bandwidth
driver: interface-state time 13625.525 if default: free 7480
driver: hdisk-state time 13625.525 hdisk 0: free 897061131 dumpers 2
driver: result time 13625.525 from chunker1: SHM-NAME 01-00025 45615
/amanda_shm_control-11409-0
driver: send-cmd time 13625.525 to dumper1: SHM-DUMP 01-00025 45615 NULL 2
saga ffffffff9efefbfffffffffffffff3fffbf79f00 /export/backup/rufous NODEVICE 0
1970:1:1:0:0:0 amgtar "" "" "" "" "" "" "" 1 "" "" BSDTCP AMANDA
/amanda_shm_control-11409-0 20 |"  <auth>BSDTCP</auth>\n
<compress>CUSTOM<custom-compress-program>/usr/local/bin/zstd-amanda</custom-compress-program>\n
 </compress>\n  <record>YES</record>\n  <index>YES</index>\n
<datapath>AMANDA</datapath>\n  <exclude>\n
<list>/usr/local/lib/amanda/exclude.gtar</list>\n  </exclude>\n""""
<backup-program>\n    <plugin>amgtar</plugin>\n    <property>\n
<name>atime-preserve</name>\n      <value>no</value>\n    </property>\n
<property>\n      <name>one-file-system</name>\n      <value>no</value>\n
</property>\n    <property>\n      <name>check-device</name>\n
<value>NO</value>\n    </property>\n  </backup-program>\n"
driver: state time 14304.799 free kps: 7480 space: 897061131 taper: down
idle-dumpers: 4 qlen runq: 31 directq: 0 roomq: 0 wakeup: 0 driver-idle:
no-bandwidth


driver: hdisk-state time 22627.693 hdisk 0: free 889377437 dumpers 1
driver: result time 22627.694 from chunker1: FAILED 01-00025 "Got empty header"
driver: send-cmd time 22627.694 to chunker1: QUIT
driver: state time 22627.694 free kps: 54377 space: 889377437 taper: down
idle-dumpers: 5 qlen runq: 24 directq: 0 roomq: 0 wakeup: 0 driver-idle:
no-bandwidth
driver: interface-state time 22627.694 if default: free 54377
driver: hdisk-state time 22627.694 hdisk 0: free 889377437 dumpers 1
driver: result time 22627.694 from dumper1: FAILED 01-00025 "[data timeout]"



-- 
Orion Poplawski
IT Systems Manager                         720-772-5637
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                 https://www.nwra.com/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to