Re: can't kill a non-numeric process ID

Debra S Baddorf Mon, 06 Jan 2020 13:36:18 -0800

> On Jan 6, 2020, at 3:13 PM, Chris Hoogendyk <[email protected]> wrote:
> 
> 
> On 9/5/19 6:48 PM, Nathan Stratton Treadway wrote:
>> On Thu, Sep 05, 2019 at 14:12:29 -0400, Chris Hoogendyk wrote:
>>> From various pieces of information, I decided there were two runs
>>> from August 31 and Septermber 1st that were hung and their tapers
>>> were holding the drives. amcleanup -k said:
>>> 
>>>    amcleanup: no unprocessed logfile to clean up
>>>    amcleanup: /usr/local/sbin/amcleanupdisk stderr: amcleanupdisk: Can't 
>>> kill a non-numeric process
>>>    ID at /usr/local/share/perl/5.22.1/Amanda/Holding.pm line 244.
>>> 
>>>    amcleanup: /usr/local/sbin/amcleanupdisk stderr:
>>> 
>> On Thu, Sep 05, 2019 at 17:17:51 -0400, Chris Hoogendyk wrote:
>>> amanda? (or amcleanup being able to deal with multiple instances for
>>> that matter?) Is that a bug? Or just development that was never
>>> completed? And how difficult would it be revise the code to do this?
>> Unfortunately I think only Jean-Louis really knew the answer to that,
>> but looking at the code for amcleanup it doesn't appear to make any
>> attempt to deal with multiple instances.
>> 
>> More generally, amcleanup simply looks for a "log" symlink in the
>> "logdir" directory, and processes the log.<DATESTAMP> pointed to by
>> that.  As far as I understand, that symlink is created each time amdump
>> starts, pointing to that instance's log file.
>> 
>> So, as soon as some new parallel instance starts, there's no longer any
>> "log" symlink pointing to the earlier instance(s)'s log file(s).  If
>> that latest instance then terminates cleanly (as, for example, was
>> probably the case for the instance at your site which gave up when it
>> couldn't find an any available tape drives), then the "log" symlink will
>> continue to point to a "completed" log... even though earlier instances
>> are still out there running (or died without a clean shutdown).
>> 
>> I haven't tried it myself, but based on what I am reading it looks like
>> the next time you run in to this situation, you should be able to
>> manually update the "log" symlink to point to the log.* file for a
>> still-running instance before you run "amcleanup", thus allowing that
>> particular instance to get cleaned up.  If you did this once for each
>> still-running instance, theoretically you'd end up with everything
>> properly killed and Amanda email reports for each one, etc....
>> 
>> (But note that you would need to make sure there was at least enough
>> free space on the holding disk that the "pid" files could be created
>> successfully, or you run into that "can't kill a non-numeric process ID"
>> bug....)
>> 
>> I suspect a "real" fix for this situation would involve some
>> re-architecting of the whole parallel-instances situation....
>> 
>> For example, in addition to the simple "log", "amdump", "amdump.1",
>> "amflush", and "amflush.1" symlinks currently used, perhaps there should
>> also be "<prefix>.<DATESTAMP>.running" symlink created at the start of
>> the run, and then removed as part of the end-of-run cleanup.  That way,
>> both amstatus and amcleanup could just search for *.running symlinks as
>> a way to detect still-running (or uncleanly shut down) instances.
>> 
>> But obviously that involves changing all the places where these files
>> and symlinks are initially created and where they are cleaned up... so
>> I'm not sure how hard that would be.
>> 
>>                                              Nathan
>> 
> 
> hmm.
> 
> Well, this situation came up again, and that didn't actually work.
> 
> I had two jobs running, one started Saturday evening and one started Sunday 
> evening. Both holding disks were 100% full. I ran amstatus and found that the 
> "current" run was flushing a reasonably large DLE and that there should be 
> more than sufficient space left on the tape. Half an hour later, I checked 
> again, and the numbers were identical. No progress. The web interface for the 
> tape library showed both tape drives idle, but with appropriate tapes loaded. 
> So, I issued an `amcleanup -k daily`. That, of course, worked fine. I got a 
> report for the Sunday night run and all it's processes were gone.
> 
> I tried switching the log symlink to point to the Saturday night log file and 
> then running amcheck. That didn't work. So, I tried also changing the amdump 
> symlink to the Saturday night amdump file. The two together gave me the 
> output for the Saturday night run. It showed that the tape was full; and, 
> with the holding disks also full, 5 DLEs were waiting for dumping, one had 
> failed, and Amanda was simply hung waiting. Assuming that the symlinks were 
> doing the job, I issued an `amcleanup -k daily`. That seemed to work. The 
> processes were killed and a report was sent. However, the report was 
> virtually empty. So something else is missing in terms of symlinks or coding 
> or something.
> 
> Report from the two day old run is below.
> 
> Then I ran amcheck and everything looked fine, but amflush ran into the 
> "can't kill anon-numeric process ID". I found several empty pid files on the 
> holding disks and removed them. Then amflush launched alright. However, it 
> only offered me the last two runs to flush files from. The holding disks show 
> directories for a couple of other dates. How does that happen? How does one 
> clean that up?
> 
> Running Amanda version 3.5.1 on Ubuntu 16.04 Server LTS.
> 
> 
> -- 
> ---------------
> 
> Chris Hoogendyk
> 
> -
>   O__  ---- Systems Administrator
>  c/ /'_ --- Biology & Geosciences Departments
> (*) \(*) -- 315 Morrill Science Center
> ~~~~~~~~~~ - University of Massachusetts, Amherst
> 
> <[email protected]>
> 


This is too simplistic, but when amflush shows nothing it is willing to flush,  
I frequently
delete things from the holding areas …. especially when their date is 
significantly old.
     rm -fr   <specifics>/amanda/daily/20191002*
does the trick.  Amanda never asks me about them or any such.

Deb Baddorf
Fermilab
Re: can't kill a non-numeric process ID

Reply via email to