Amanda users,

Sorry I took this offline with Jean-Louis. I was sending him multi-megabyte debug files and didn't want to spam the list. Stefan Weichinger in particular indicated an interest in the outcome, so I'm forwarding a tail of the exchange.

Jean-Louis sent me a patch for a possible race condition. That got rid of the dump failures with the "shm_ring cancelled." After applying the patch, the Amanda email report showed no errors except the usual "strange" messages gnutar spits when a log file changes while being backed up. I should be able to get rid of those as well, since I now have everything using application app-amgtar.

I've attached the patch for those interested. It requires a rebuild of Amanda. Since I had kept the source directory, and had already done the configure, all I needed to do was apply the patch, make, and make install. amcheck daily looked good, and Amanda ran without a problem.


---------------

Chris Hoogendyk

-
   O__  ---- Systems Administrator
  c/ /'_ --- Biology & Geosciences Departments
 (*) \(*) -- 315 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst

<[email protected]>

---------------

Erdös 4



-------- Forwarded Message --------
Subject:        Re: Amanda 3.4.5 – failure dump summary – "shm_ring cancelled"
Date:   Tue, 12 Sep 2017 11:19:14 -0400
From:   Chris Hoogendyk <[email protected]>
To:     Jean-Louis Martineau <[email protected]>



Not quite done yet (still has ~1.8TB left to write to tape), but

   amanda@marlin:~/daily$ amstatus daily | grep '\/var'
   back-auth.bio.mor.nsm:/var          20170911233001 0 196622k dump done (1+ 
1:48:21), (daily)
   written (1+ 2:02:24)
   localhost:/var/log                  20170911233001 2 86537k dump done (1+ 
0:27:05), (daily)
   written (1+ 0:27:15)
   localhost:/var/mail                 20170911233001 0 81811374k dump done (1+ 
3:01:18), (daily)
   written (1+ 4:57:58)
   morrill-auth.bio.mor.nsm:/var       20170911233001 0 789331k dump done (1+ 
1:50:11), (daily)
   written (1+ 3:42:06)
   snapper.bio.mor.nsm:/var            20170911233001 1 1266877k dump done (1+ 
0:32:25), (daily)
   written (1+ 0:32:38)
   amanda@marlin:~/daily$

indicates that it worked. I'll send an update when the report comes in.


On 9/11/17 5:29 PM, Jean-Louis Martineau wrote:
On 11/09/17 05:08 PM, Chris Hoogendyk wrote:
Thank you.

I believe I got that patched, built and installed. It was a one line change in
amanda-3.4.5/server-src/dumper.c, right?
Yes, a single line.

It will run tonight (amcheck thought everything was alright), and I will let 
you know tomorrow.
I'm waiting for the result.

Jean-Louis


On 9/11/17 4:33 PM, Jean-Louis Martineau wrote:
> Chris,
>
> I can't reproduce the issue but looking at the code I see a possible race.
> Can you try the attached patch, it require a recompialtion of amanda.
>
> Jean-Louis
>
> On 11/09/17 03:54 PM, Chris Hoogendyk wrote:
>> One mistake in my message – both instances of /var are on other servers in 
the server room, not
>> localhost. Note that the Amanda server is running Amanda 3.4.5, and the 
other servers are running
>> Amanda 3.3.6.
>>
>>
>> On 9/11/17 1:29 PM, Chris Hoogendyk wrote:
>> > Seems to be the same DLEs. Over the weekend (Fri, Sa, Su), the same DLEs failed on every backup –
>> > /var on both localhost and on one other server in the server room.
>> >
>> > I've attached the appropriate log and debug files from Saturday night, 
focusing on
>> > snapper.bio.mor.nsm /var.
>> >
>> >
>> > On 9/8/17 10:46 AM, Jean-Louis Martineau wrote:
>> >> On 07/09/17 07:05 PM, Chris Hoogendyk wrote:
>> >> > I'm sorry. We just ran aptitude update/upgrade on the server for a
>> >> > kernel security patch and
>> >> > rebooted. /tmp is empty.
>> >> >
>> >> > I'll have to start over tomorrow if the same error occurs.
>> >> >
>> >> > How do I identify the chunker file if there are dozens of them for the
>> >> > time period? Do I grep
>> >> > 'snapper.bio.mor.nsm_var'?
>> >>
>> >> grep is the best method.
>> >> You grep the dle that will fail on that run.
>> >> Is it always the same dle that fail that way?
>> >>
>> >> Jean-Louis
>> >>
diff --git a/server-src/dumper.c b/server-src/dumper.c
index 78545a8..4dba018 100644
--- a/server-src/dumper.c
+++ b/server-src/dumper.c
@@ -3013,7 +3013,7 @@ stop_dump(void)
        }
     }
 
-    if (dump_result > 0) {
+    if (dump_result > 1) {
        if (g_databuf->shm_ring_producer) {
            g_debug("stop_dump: cancelling shm-ring-producer");
            g_databuf->shm_ring_producer->mc->cancelled = TRUE;

Reply via email to