Re: Upgrade woes and eternal hanging of dumps

Debra S Baddorf Mon, 21 Sep 2015 09:39:47 -0700

YES!    I agree with the first and third of these tidbits.  I just couldn’t 
remember them.  I’ve had issues with both of them.  Including the tricky 
firewall timeout part,  in Idea Three.


Here’s hoping you have a network person who can add some skills or ideas at 
that level.   Or,  just don’t do client estimates,  as in the first suggested 
fix. 

I think we had to allow trusted clients to initiate their OWN connections back 
to the server  (via a firewall rule),  so that they could still talk to the 
server even after that server-created conversation had timed out.   That might 
count as fix #3,  but it takes firewall skills.   That might be a slightly 
different problem/situation  (it sounds a little different)  but I think it’s 
in this same category, somewhere.  Network savvy people, can you translate my 
“generic English” description into what we actually did?
Deb Baddorf,  Fermilab


On Sep 21, 2015, at 10:25 AM, Joi L. Ellis <[email protected]> wrote:

> I've just read through the long thread prompted by this particular post.  I'd 
> like to offer a few points I didn't see mentioned before...
> 
> Idea one: You upgraded from 2.5 to 3.3.  2.5 amdump only spoke UDP with a 
> 'bsd' auth protocol, so that was the only action available.  Thus, inetd.conf 
> didn't specify an -auth=bsd parameter.  3.3 defaults to -auth=bsdtcp if you 
> don't provide it.  Does your new configuration specify that those clients 
> must be reached with -auth=bsd from the new server, rather than the server's 
> new default of -auth=bsdctp? 
> 
> Idea two: If any of the involved machines are running iptables or ufw 
> firewalls, verify the new configuration is still loading the correct kernel 
> modules. At one point the /etc/default/ufw.conf file named kernel modules 
> incorrectly after an upgrade, and/or the nf_conntrack_amanda module itself 
> went missing.  (Some kernels change the name of this module, usually it's the 
> first two characters.)  The symptom here is that amcheck thinks everything is 
> fine, yet the actual amdump process fails because the  UDP control 
> conversation between the server and the client is allowed, but the TCP data 
> stream amdump uses with -auth=bsdtcp is blocked.
> 
> Idea Three: I run an Amanda 3.3.3 server, and I have experienced a similar 
> problem to your own.  I've tried posting about it here in the past and got 
> null response, so I gave up asking for help and figured out my own 
> workarounds.
> 
> My amanda server is behind a corporate firewall, and some of the clients are 
> in the DMZ, outside the firewall... and they are running amanda 2.5 due to 
> the age of the client hosts.   I've had repeated issues with the corporate 
> firewall interfering with the planner.  
> 
> The issue I see is that the amanda server planner fires off a UDP 
> "connection" to the client, asking the client to provide estimates.  The 
> client does so... BUT.  That blasted firewall has created a dynamic NAT rule 
> that will allow the client to send back its UDP response.  IF the client's 
> response doesn't appear before the NAT rule expires, the planner falls into a 
> permanent wait state, waiting for a UDP response that will never arrive 
> because the firewall has blocked it.  The client has no idea it failed, and 
> its logs look entirely normal.
> 
> If you dig into the server's logs, you will probably find TIMEOUT errors in 
> the logs from the planner.  I don't have any recent logs that illustrate this 
> error, so I can't quote an example.
> 
> I worked around this in two ways (varies with the client situation:)
> 
>  *) tell amanda to not use the client to create the estimate at all
>  *) adjust the NAT timeout rules on the firewall to extend the timeout.  As I 
> recall, it was initially set to 120 seconds.  We moved it up to 300 seconds 
> at one point, but then began to experience issues with the firewall filling 
> memory tables because rules weren't timing out fast enough.
> 
> As I see it, the planner makes the (unsafe) assumption that IF its initial 
> request-an-estimate packets traveled properly, the response will always do 
> so.  If there is a firewall involved, the response might get lost, yet the 
> planner will sit there forever, twiddling its thumbs and not backing up 
> anything, until it receives the missing estimates package back from the 
> client.
> 
> To summarize, I suspect that the move from 2.5's UDP-only communication style 
> to 3.3's default TCP-only style has broken something in your environment that 
> you've overlooked.  Either the server, the clients (or both) or a firewall 
> (either an external network firewall, or a kernel firewall on one of the 
> hosts involved) are breaking your planner.  I've experienced very similar 
> symptoms after version upgades.
> 
> (And yes, I've seen my issues disappear when jobs are run manually, yet still 
> fail when run over night.  Manual tests don't trigger the firewall issues 
> because the windows I have open the to the client and server keep the darn 
> firewall from timing out the dynamic NAT rules.)
> 
> 
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Seann
>> Sent: Monday, August 17, 2015 02:34 PM
>> To: [email protected]
>> Subject: Upgrade woes and eternal hanging of dumps
>> 
>> All,
>> 
>> I am looking for a little direction on a problem that has cropped up for
>> me recently.
>> 
>> I have a backup set, that was created using Amanda 2.5 (default on CentOS
>> 5.11) and ran very well, both manually and from the cron job I had set for
>> it.
>> It has approximately 13 hosts to backup, from as simple as backing up a
>> single directory, to backing up the full system, and it ran with no issues
>> on CentOS 5.11.
>> The basic setup is using hard drives as the backup media, compressing the
>> backups to save space, using server compression, these also use GNU-TAR as
>> the archive format.
>> 
>> Fast forward to today, I have the system upgraded to CentOS 7, which also
>> upgraded to Amanda 3.3.3-13, and after some configuration file re-writing,
>> I got most of the backups to work.
>> Two systems, one backing up the web directory, the other backing up the
>> full disk, fail constantly.
>> When these two disklist statements are removed, the backup runs, and takes
>> approximately 2 and a half hours to run on the 8 other hosts (the other 3
>> hosts are currently offline and not in scope).
>> 
>> When the CRON job kicks off at midnight, it runs for over 12 hours (I have
>> the etimeout set to one day, as the planner kept dying saying to timed
>> out).
>> This is the same basic error that I get with the two above mentioned
>> failing backups.
>> 
>> When the hung backup job is running, I see the dumpers and main dump
>> process running on the backup server, but nothing in the logs outside of
>> the "We started the backup job" type of log messages.
>> On all of the hosts, I don't see the client running, nor to I see any TAR
>> processes running.
>> There are also no clues in the logs on which host the server is waiting
>> on, and checking all the hosts in scope show they are all in the same
>> state, that is they have sent the estimate to the backup server and are
>> waiting on the next phase.
>> 
>> 
>> Any help on this would be appreciated, and also is there a better way of
>> making sense of the logs (such as using something like Graylog2?), and on
>> reporting for issues with Amanda 3.3?
>> 
>> 
>> Regards,
>> Seann
>

Re: Upgrade woes and eternal hanging of dumps

Reply via email to