Amanda users,
Sorry I took this offline with Jean-Louis. I was sending him multi-megabyte debug files and didn't
want to spam the list. Stefan Weichinger in particular indicated an interest in the outcome, so I'm
forwarding a tail of the exchange.
Jean-Louis sent me a patch for a possible race condition. That got rid of the dump failures with the
"shm_ring cancelled." After applying the patch, the Amanda email report showed no errors except the
usual "strange" messages gnutar spits when a log file changes while being backed up. I should be
able to get rid of those as well, since I now have everything using application app-amgtar.
I've attached the patch for those interested. It requires a rebuild of Amanda. Since I had kept the
source directory, and had already done the configure, all I needed to do was apply the patch, make,
and make install. amcheck daily looked good, and Amanda ran without a problem.
---------------
Chris Hoogendyk
-
O__ ---- Systems Administrator
c/ /'_ --- Biology & Geosciences Departments
(*) \(*) -- 315 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst
<[email protected]>
---------------
Erdös 4
-------- Forwarded Message --------
Subject: Re: Amanda 3.4.5 – failure dump summary – "shm_ring cancelled"
Date: Tue, 12 Sep 2017 11:19:14 -0400
From: Chris Hoogendyk <[email protected]>
To: Jean-Louis Martineau <[email protected]>
Not quite done yet (still has ~1.8TB left to write to tape), but
amanda@marlin:~/daily$ amstatus daily | grep '\/var'
back-auth.bio.mor.nsm:/var 20170911233001 0 196622k dump done (1+
1:48:21), (daily)
written (1+ 2:02:24)
localhost:/var/log 20170911233001 2 86537k dump done (1+
0:27:05), (daily)
written (1+ 0:27:15)
localhost:/var/mail 20170911233001 0 81811374k dump done (1+
3:01:18), (daily)
written (1+ 4:57:58)
morrill-auth.bio.mor.nsm:/var 20170911233001 0 789331k dump done (1+
1:50:11), (daily)
written (1+ 3:42:06)
snapper.bio.mor.nsm:/var 20170911233001 1 1266877k dump done (1+
0:32:25), (daily)
written (1+ 0:32:38)
amanda@marlin:~/daily$
indicates that it worked. I'll send an update when the report comes in.
On 9/11/17 5:29 PM, Jean-Louis Martineau wrote:
On 11/09/17 05:08 PM, Chris Hoogendyk wrote:
Thank you.
I believe I got that patched, built and installed. It was a one line change in
amanda-3.4.5/server-src/dumper.c, right?
Yes, a single line.
It will run tonight (amcheck thought everything was alright), and I will let
you know tomorrow.
I'm waiting for the result.
Jean-Louis
On 9/11/17 4:33 PM, Jean-Louis Martineau wrote:
> Chris,
>
> I can't reproduce the issue but looking at the code I see a possible race.
> Can you try the attached patch, it require a recompialtion of amanda.
>
> Jean-Louis
>
> On 11/09/17 03:54 PM, Chris Hoogendyk wrote:
>> One mistake in my message – both instances of /var are on other servers in
the server room, not
>> localhost. Note that the Amanda server is running Amanda 3.4.5, and the
other servers are running
>> Amanda 3.3.6.
>>
>>
>> On 9/11/17 1:29 PM, Chris Hoogendyk wrote:
>> > Seems to be the same DLEs. Over the weekend (Fri, Sa, Su), the same DLEs failed on every
backup –
>> > /var on both localhost and on one other server in the server room.
>> >
>> > I've attached the appropriate log and debug files from Saturday night,
focusing on
>> > snapper.bio.mor.nsm /var.
>> >
>> >
>> > On 9/8/17 10:46 AM, Jean-Louis Martineau wrote:
>> >> On 07/09/17 07:05 PM, Chris Hoogendyk wrote:
>> >> > I'm sorry. We just ran aptitude update/upgrade on the server for a
>> >> > kernel security patch and
>> >> > rebooted. /tmp is empty.
>> >> >
>> >> > I'll have to start over tomorrow if the same error occurs.
>> >> >
>> >> > How do I identify the chunker file if there are dozens of them for the
>> >> > time period? Do I grep
>> >> > 'snapper.bio.mor.nsm_var'?
>> >>
>> >> grep is the best method.
>> >> You grep the dle that will fail on that run.
>> >> Is it always the same dle that fail that way?
>> >>
>> >> Jean-Louis
>> >>
diff --git a/server-src/dumper.c b/server-src/dumper.c
index 78545a8..4dba018 100644
--- a/server-src/dumper.c
+++ b/server-src/dumper.c
@@ -3013,7 +3013,7 @@ stop_dump(void)
}
}
- if (dump_result > 0) {
+ if (dump_result > 1) {
if (g_databuf->shm_ring_producer) {
g_debug("stop_dump: cancelling shm-ring-producer");
g_databuf->shm_ring_producer->mc->cancelled = TRUE;