Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and most of it shows itself in the planner.

Issue #1:

Quick test, when ANY DLE entry is deleted, the planner will error out with the message:

planner: assert: ep->estimate[0].level != -1 is false: file planner.c, line 1313
Aborted

Just remove any entries from a healthy set and run this command to test:

/usr/libexec/amanda/planner $SET --starttime now --log-filename /tmp/planner.log


Issue #2
In other, less well-defined situations, the planner will seemingly get to the end of the estimates, then get stuck in some read/close/fstat/futex/mmap loop:

[pid  2463] close(120)                  = -1 EBADF (Bad file descriptor)
[pid  2461] read(4,  <unfinished ...>
[pid  2463] close(121 <unfinished ...>
[pid  2461] <... read resumed> " 636/udp                     "..., 4096) = 4096
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2463] close(122 <unfinished ...>
[pid  2461] close(4 <unfinished ...>
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2461] <... close resumed> )       = 0
[pid  2463] close(123 <unfinished ...>
[pid  2461] munmap(0x7fd5e73d5000, 4096 <unfinished ...>
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2463] close(124 <unfinished ...>
[pid  2461] <... munmap resumed> )      = 0
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  2463] close(125 <unfinished ...>
[pid  2461] <... futex resumed> )       = 1
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2463] close(126 <unfinished ...>
[pid  2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120 <unfinished ...>
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2461] <... write resumed> )       = 120
[pid  2463] close(127 <unfinished ...>
[pid  2461] open("/etc/services", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2461] <... open resumed> )        = 4
[pid  2463] close(128 <unfinished ...>
[pid  2461] fstat(4,  <unfinished ...>
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2463] close(129 <unfinished ...>
[pid  2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330, ...}) = 0
[pid  2463] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid  2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid  2463] close(130 <unfinished ...>

While this is going, the planner grows to consume over 100GBytes(!!!) of memory before the OOM killer stops it.  The problem I'm able to reproduce right now is only logging a bit about ambind, which drew my attention but seems nonsensical as this was working until days ago with no server changes, and another set on the same system is still working just fine:

Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner: security_stream_close(0x7fd5cc0017e0) Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner: ambind failed: ambind: bind failed A: Permission denied Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner: stream_client: Could not bind to port in range 512-1023. Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner: security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP) error=ambind: bind failed A: Permission denied) Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner: security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP)) Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner: security_stream_close(0x7fd5dc001eb0) Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner: ambind failed: ambind: bind failed A: Permission denied Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner: stream_client: Could not bind to port in range 512-1023.


By doing nothing else but calling the planner from 3.5 instead of 3.5.1 (which needs a handful of 3.5 libs as well), both these issues are resolved.  I tracked it down even more, and making the old planner.c use newer libs gives it the same problems.  The single problematic lib is libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the issue, for whatever that's worth to anyone here.

I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...


Reply via email to