On 10/7/20 4:58 PM, Nathan Stratton Treadway wrote:
On Wed, Oct 07, 2020 at 14:54:30 -0400, Steve Ryan wrote:
I'm trying to debug an issue we've been having in our amanda
3.5.1 setup. Currently backups are failing every night due to (I
believe) the driver faulting. Relevant logs:
amdump mail report:
FAILURE DUMP SUMMARY:
chunker: FATAL Broken pipe at
/usr/lib64/perl5/vendor_perl/Amanda/IPC/LineProtocol.pm line 429.
chunker: FATAL Connection reset by peer at
/usr/lib64/perl5/vendor_perl/Amanda/IPC/LineProtocol.pm line 579.
dmesg:
2020-10-07T01:06:08.770127-04:00 vacuum.cs.umd.edu kernel: traps:
driver[25995] general protection ip:7f2a9ffe50ec sp:7ffc61f8b040
error:0 in libamanda-3.5.1.so[7f2a9ffaa000+81000]
The environment is about ~80ish nodes total, running mostly RHEL7
with some RHEL8 and ~3-5 Ubuntu/Debian machines. Everything is
running 3.5.1. straight from the official sources. I don't think
it's being caused by a client machine anyway, and some machines get
backed up each night.
I don't remember seeing this particular problem reported here before and
don't have any silver bullet...
Which distribution is the Amanda server running on?
Was this setup of Amanda-server-and-~80ish-clients ever working
properly at some point before this crashing started??
The Amanda server is running on RHEL7.7. It was working in the past;
according to the logs the issue first began on the March 1st backup. As
far as I can tell, no changes were made to the machine around that time.
Has anyone seen this issue before/know what debug info I should be
looking for in the logs?
If the driver proceess is indeed core dumping, you should see evidence
of that in /var/log/amanda/server/<CONFIG>/driver.<DATESTAMP>.debug for
that run. At the very least the log should end abruptly; if you are
lucky there you might find a stack trace or something givening a clue as
to what is happening just before the crash.
Looks like mostly just abrupt file ends. I don't see anything that
indicates a "Finished run"/etc and it doesn't end with any particular
clients. Still, I do see machines near the bottom more then others so
I'll try setting those to dontdump and see if it helps.
[...]
You can also look at the chunker.<DATESTAMP>.debug files in that same
directory to see if they give any additional hits, but off hand I'd
guess that they are just going to report that the chunker processes are
aborting due to the fact that the far side of the socket/pipe
disappeared, which presumably is caused by the driver process
crashing....
Interestingly, the only chunker logs I see don't *seem* to show any
errors. I'm also not 100% what they should look like though; I'll
research more.
--
-Steve Ryan
IT Analyst