Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and
most of it shows itself in the planner.
Issue #1:
Quick test, when ANY DLE entry is deleted, the planner will error out
with the message:
planner: assert: ep->estimate[0].level != -1 is false: file planner.c,
line 1313
Aborted
Just remove any entries from a healthy set and run this command to test:
/usr/libexec/amanda/planner $SET --starttime now --log-filename
/tmp/planner.log
Issue #2
In other, less well-defined situations, the planner will seemingly get
to the end of the estimates, then get stuck in some
read/close/fstat/futex/mmap loop:
[pid 2463] close(120) = -1 EBADF (Bad file descriptor)
[pid 2461] read(4, <unfinished ...>
[pid 2463] close(121 <unfinished ...>
[pid 2461] <... read resumed> " 636/udp "...,
4096) = 4096
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(122 <unfinished ...>
[pid 2461] close(4 <unfinished ...>
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] <... close resumed> ) = 0
[pid 2463] close(123 <unfinished ...>
[pid 2461] munmap(0x7fd5e73d5000, 4096 <unfinished ...>
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(124 <unfinished ...>
[pid 2461] <... munmap resumed> ) = 0
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 2463] close(125 <unfinished ...>
[pid 2461] <... futex resumed> ) = 1
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(126 <unfinished ...>
[pid 2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120
<unfinished ...>
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] <... write resumed> ) = 120
[pid 2463] close(127 <unfinished ...>
[pid 2461] open("/etc/services", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] <... open resumed> ) = 4
[pid 2463] close(128 <unfinished ...>
[pid 2461] fstat(4, <unfinished ...>
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(129 <unfinished ...>
[pid 2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330,
...}) = 0
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 2463] close(130 <unfinished ...>
While this is going, the planner grows to consume over 100GBytes(!!!) of
memory before the OOM killer stops it. The problem I'm able to
reproduce right now is only logging a bit about ambind, which drew my
attention but seems nonsensical as this was working until days ago with
no server changes, and another set on the same system is still working
just fine:
Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner:
security_stream_close(0x7fd5cc0017e0)
Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner:
ambind failed: ambind: bind failed A: Permission denied
Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner:
stream_client: Could not bind to port in range 512-1023.
Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner:
security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680
(BSDTCP) error=ambind: bind failed A: Permission denied)
Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner:
security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner:
security_stream_close(0x7fd5dc001eb0)
Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner:
ambind failed: ambind: bind failed A: Permission denied
Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner:
stream_client: Could not bind to port in range 512-1023.
By doing nothing else but calling the planner from 3.5 instead of 3.5.1
(which needs a handful of 3.5 libs as well), both these issues are
resolved. I tracked it down even more, and making the old planner.c use
newer libs gives it the same problems. The single problematic lib is
libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the
issue, for whatever that's worth to anyone here.
I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...