On 3/18/19 12:20 PM, Ryan wrote: > Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and > most of it shows itself in the planner. > > Issue #1: > > Quick test, when ANY DLE entry is deleted, the planner will error out > with the message: > >> planner: assert: ep->estimate[0].level != -1 is false: file planner.c, >> line 1313 >> Aborted > > Just remove any entries from a healthy set and run this command to test: > >> /usr/libexec/amanda/planner $SET --starttime now --log-filename >> /tmp/planner.log
FWIW, I cannot reproduce this. Are there any other error messages?
>
> Issue #2
> In other, less well-defined situations, the planner will seemingly get
> to the end of the estimates, then get stuck in some
> read/close/fstat/futex/mmap loop:
>
>> [pid 2463] close(120)                 = -1 EBADF (Bad
>> file descriptor)
>> [pid 2461] read(4, <unfinished ...>
>> [pid 2463] close(121 <unfinished ...>
>> [pid 2461] <... read resumed> " 636/udp                 Â
>> Â Â "...,
>> 4096) = 4096
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(122 <unfinished ...>
>> [pid 2461] close(4 <unfinished ...>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... close resumed> )      = 0
>> [pid 2463] close(123 <unfinished ...>
>> [pid 2461] munmap(0x7fd5e73d5000, 4096 <unfinished ...>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(124 <unfinished ...>
>> [pid 2461] <... munmap resumed> )     = 0
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
>> [pid 2463] close(125 <unfinished ...>
>> [pid 2461] <... futex resumed> )      = 1
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(126 <unfinished ...>
>> [pid 2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120
>> <unfinished ...>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... write resumed> )      = 120
>> [pid 2463] close(127 <unfinished ...>
>> [pid 2461] open("/etc/services", O_RDONLY|O_CLOEXEC <unfinished ...>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... open resumed> )       = 4
>> [pid 2463] close(128 <unfinished ...>
>> [pid 2461] fstat(4, <unfinished ...>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(129 <unfinished ...>
>> [pid 2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330,
>> ...}) = 0
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
>> [pid 2463] close(130 <unfinished ...>
>
> While this is going, the planner grows to consume over 100GBytes(!!!) of
> memory before the OOM killer stops it. The problem I'm able to
> reproduce right now is only logging a bit about ambind, which drew my
> attention but seems nonsensical as this was working until days ago with
> no server changes, and another set on the same system is still working
> just fine:
>
>> Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner:
>> security_stream_close(0x7fd5cc0017e0)
>> Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner:
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner:
>> stream_client: Could not bind to port in range 512-1023.
>> Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner:
>> security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680
>> (BSDTCP) error=ambind: bind failed A: Permission denied)
>> Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner:
>> security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
>> Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner:
>> security_stream_close(0x7fd5dc001eb0)
>> Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner:
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner:
>> stream_client: Could not bind to port in range 512-1023.
>
>
> By doing nothing else but calling the planner from 3.5 instead of 3.5.1
> (which needs a handful of 3.5 libs as well), both these issues are
> resolved. I tracked it down even more, and making the old planner.c use
> newer libs gives it the same problems. The single problematic lib is
> libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the
> issue, for whatever that's worth to anyone here.
>
> I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...
>
I haven't seen the memory issue either.
--
Orion Poplawski
Manager of NWRA Technical Systems 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 https://www.nwra.com/
smime.p7s
Description: S/MIME Cryptographic Signature
