On Thu, Sep 05, 2019 at 14:12:29 -0400, Chris Hoogendyk wrote:
> From various pieces of information, I decided there were two runs
> from August 31 and Septermber 1st that were hung and their tapers
> were holding the drives. amcleanup -k said:
>
> amcleanup: no unprocessed logfile to clean up
> amcleanup: /usr/local/sbin/amcleanupdisk stderr: amcleanupdisk: Can't kill
> a non-numeric process
> ID at /usr/local/share/perl/5.22.1/Amanda/Holding.pm line 244.
>
> amcleanup: /usr/local/sbin/amcleanupdisk stderr:
>
On Thu, Sep 05, 2019 at 17:17:51 -0400, Chris Hoogendyk wrote:
> amanda? (or amcleanup being able to deal with multiple instances for
> that matter?) Is that a bug? Or just development that was never
> completed? And how difficult would it be revise the code to do this?
Unfortunately I think only Jean-Louis really knew the answer to that,
but looking at the code for amcleanup it doesn't appear to make any
attempt to deal with multiple instances.
More generally, amcleanup simply looks for a "log" symlink in the
"logdir" directory, and processes the log.<DATESTAMP> pointed to by
that. As far as I understand, that symlink is created each time amdump
starts, pointing to that instance's log file.
So, as soon as some new parallel instance starts, there's no longer any
"log" symlink pointing to the earlier instance(s)'s log file(s). If
that latest instance then terminates cleanly (as, for example, was
probably the case for the instance at your site which gave up when it
couldn't find an any available tape drives), then the "log" symlink will
continue to point to a "completed" log... even though earlier instances
are still out there running (or died without a clean shutdown).
I haven't tried it myself, but based on what I am reading it looks like
the next time you run in to this situation, you should be able to
manually update the "log" symlink to point to the log.* file for a
still-running instance before you run "amcleanup", thus allowing that
particular instance to get cleaned up. If you did this once for each
still-running instance, theoretically you'd end up with everything
properly killed and Amanda email reports for each one, etc....
(But note that you would need to make sure there was at least enough
free space on the holding disk that the "pid" files could be created
successfully, or you run into that "can't kill a non-numeric process ID"
bug....)
I suspect a "real" fix for this situation would involve some
re-architecting of the whole parallel-instances situation....
For example, in addition to the simple "log", "amdump", "amdump.1",
"amflush", and "amflush.1" symlinks currently used, perhaps there should
also be "<prefix>.<DATESTAMP>.running" symlink created at the start of
the run, and then removed as part of the end-of-run cleanup. That way,
both amstatus and amcleanup could just search for *.running symlinks as
a way to detect still-running (or uncleanly shut down) instances.
But obviously that involves changing all the places where these files
and symlinks are initially created and where they are cleaned up... so
I'm not sure how hard that would be.
Nathan
----------------------------------------------------------------------------
Nathan Stratton Treadway - [email protected] - Mid-Atlantic region
Ray Ontko & Co. - Software consulting services - http://www.ontko.com/
GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239
Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239