Re: Upgrade woes and eternal hanging of dumps

Seann Sun, 27 Sep 2015 01:25:32 -0700

On 9/25/2015 1:34 PM, Seann wrote:

On 9/25/2015 10:51 AM, Debra S Baddorf wrote:
see comment embedded way down below
On Sep 24, 2015, at 6:35 PM, Seann <nombran...@tsukinokage.net> wrote:
On 9/21/2015 11:36 AM, Debra S Baddorf wrote:
YES! I agree with the first and third of these tidbits. I justcouldn’t remember them. I’ve had issues with both of them.Including the tricky firewall timeout part, in Idea Three.
Here’s hoping you have a network person who can add some skills orideas at that level. Or, just don’t do client estimates, as inthe first suggested fix.
I think we had to allow trusted clients to initiate their OWNconnections back to the server (via a firewall rule), so that theycould still talk to the server even after that server-createdconversation had timed out. That might count as fix #3, but ittakes firewall skills. That might be a slightly differentproblem/situation (it sounds a little different) but I think it’sin this same category, somewhere. Network savvy people, can youtranslate my “generic English” description into what we actually did?
Deb Baddorf,  Fermilab
On Sep 21, 2015, at 10:25 AM, Joi L. Ellis<jlel...@pavlovmedia.com> wrote:
I've just read through the long thread prompted by this particularpost. I'd like to offer a few points I didn't see mentionedbefore...
Idea one: You upgraded from 2.5 to 3.3. 2.5 amdump only spoke UDPwith a 'bsd' auth protocol, so that was the only actionavailable. Thus, inetd.conf didn't specify an -auth=bsdparameter. 3.3 defaults to -auth=bsdtcp if you don't provide it.Does your new configuration specify that those clients must bereached with -auth=bsd from the new server, rather than theserver's new default of -auth=bsdctp?
Idea two: If any of the involved machines are running iptables orufw firewalls, verify the new configuration is still loading thecorrect kernel modules. At one point the /etc/default/ufw.conffile named kernel modules incorrectly after an upgrade, and/or thenf_conntrack_amanda module itself went missing. (Some kernelschange the name of this module, usually it's the first twocharacters.) The symptom here is that amcheck thinks everythingis fine, yet the actual amdump process fails because the UDPcontrol conversation between the server and the client is allowed,but the TCP data stream amdump uses with -auth=bsdtcp is blocked.
Idea Three: I run an Amanda 3.3.3 server, and I have experienced asimilar problem to your own. I've tried posting about it here inthe past and got null response, so I gave up asking for help andfigured out my own workarounds.
My amanda server is behind a corporate firewall, and some of theclients are in the DMZ, outside the firewall... and they arerunning amanda 2.5 due to the age of the client hosts. I've hadrepeated issues with the corporate firewall interfering with theplanner.
The issue I see is that the amanda server planner fires off a UDP"connection" to the client, asking the client to provideestimates. The client does so... BUT. That blasted firewall hascreated a dynamic NAT rule that will allow the client to send backits UDP response. IF the client's response doesn't appear beforethe NAT rule expires, the planner falls into a permanent waitstate, waiting for a UDP response that will never arrive becausethe firewall has blocked it. The client has no idea it failed,and its logs look entirely normal.
If you dig into the server's logs, you will probably find TIMEOUTerrors in the logs from the planner. I don't have any recent logsthat illustrate this error, so I can't quote an example.
I worked around this in two ways (varies with the client situation:)

  *) tell amanda to not use the client to create the estimate at all
*) adjust the NAT timeout rules on the firewall to extend thetimeout. As I recall, it was initially set to 120 seconds. Wemoved it up to 300 seconds at one point, but then began toexperience issues with the firewall filling memory tables becauserules weren't timing out fast enough.
As I see it, the planner makes the (unsafe) assumption that IF itsinitial request-an-estimate packets traveled properly, theresponse will always do so. If there is a firewall involved, theresponse might get lost, yet the planner will sit there forever,twiddling its thumbs and not backing up anything, until itreceives the missing estimates package back from the client.
To summarize, I suspect that the move from 2.5's UDP-onlycommunication style to 3.3's default TCP-only style has brokensomething in your environment that you've overlooked. Either theserver, the clients (or both) or a firewall (either an externalnetwork firewall, or a kernel firewall on one of the hostsinvolved) are breaking your planner. I've experienced verysimilar symptoms after version upgades.
(And yes, I've seen my issues disappear when jobs are runmanually, yet still fail when run over night. Manual tests don'ttrigger the firewall issues because the windows I have open the tothe client and server keep the darn firewall from timing out thedynamic NAT rules.)
-----Original Message-----
From: owner-amanda-us...@amanda.org[mailto:owner-amanda-us...@amanda.org]
On Behalf Of Seann
Sent: Monday, August 17, 2015 02:34 PM
To: amanda-users@amanda.org
Subject: Upgrade woes and eternal hanging of dumps

All,
I am looking for a little direction on a problem that has croppedup for
me recently.
I have a backup set, that was created using Amanda 2.5 (defaulton CentOS5.11) and ran very well, both manually and from the cron job Ihad set for
it.
It has approximately 13 hosts to backup, from as simple asbacking up asingle directory, to backing up the full system, and it ran withno issues
on CentOS 5.11.
The basic setup is using hard drives as the backup media,compressing thebackups to save space, using server compression, these also useGNU-TAR as
the archive format.
Fast forward to today, I have the system upgraded to CentOS 7,which alsoupgraded to Amanda 3.3.3-13, and after some configuration filere-writing,
I got most of the backups to work.
Two systems, one backing up the web directory, the other backingup the
full disk, fail constantly.
When these two disklist statements are removed, the backup runs,and takesapproximately 2 and a half hours to run on the 8 other hosts (theother 3
hosts are currently offline and not in scope).
When the CRON job kicks off at midnight, it runs for over 12hours (I havethe etimeout set to one day, as the planner kept dying saying totimed
out).
This is the same basic error that I get with the two above mentioned
failing backups.

When the hung backup job is running, I see the dumpers and main dump
process running on the backup server, but nothing in the logsoutside of
the "We started the backup job" type of log messages.
On all of the hosts, I don't see the client running, nor to I seeany TAR
processes running.
There are also no clues in the logs on which host the server iswaitingon, and checking all the hosts in scope show they are all in thesamestate, that is they have sent the estimate to the backup serverand are
waiting on the next phase.
Any help on this would be appreciated, and also is there a betterway ofmaking sense of the logs (such as using something likeGraylog2?), and on
reporting for issues with Amanda 3.3?


Regards,
Seann
It has been a while since I updated the few folk who have beenfollowing this thread.
I haven't had much time to tinker with this, as it is on my homenetwork, and my day job got in the way.
To answer the questions from Joi:
1. Yes, I swapped the auth in the global section to use bsd for allof my hosts, unless overridden in the disk definition.2. 70% of the clients don't run a firewall, and those that do, allowthe UDP ports for Amanda 10080/udp by default.3. All my clients are behind the same firewall as the Amanda server,so no firewall outside of the host firewalls are in play with this.
Reading a few of the other threads on the list, there was themention of xinetd being depreciated, and since my setup wasn'tworking in the first place, I flipped everything over to ssh.
I used puppet to push SSH keys to the proper user's on the clientmachines, went through and ssh'd to each host by FQDN and acceptedthe host keys, and ensured the users worked on the client machines.
When running amcheck on them now, what I get is this:
WARNING: server1.tsukinokage.net: selfcheck request failed: EOF onread from server1.tsukinokage.net.
Now that you have switched to SSH, did you switch the disklist torequest a different type of dump from the client?I just got bit by that yesterday — and “selfcheck failed” is whatshows up. If the client is now set up to do SSH [kerberos, in mycase] andthe server is still asking for the old kind of dump EVEN ONCE, theinitlal amcheck won’t even succeed the selfcheck. As above.
Which doesn’t remind you to change the dump type, but WHEN you dochange the dumptype to match what the client is planning to do,
it worked my better  (in fact,  totally fixed,  for me).

Just a thought.
Deb
To save some reading space, I have the log output near the end ofthe thread.
Searching for anything on this has been nearly impossible, namely asto what the hel the 'EOF on read" is related to, and troubleshootingsteps
Ultimately I am stuck in a chicken and the egg situation, where Ineed to back up directories on some servers, to the backup server,prior to upgrading their operating systems from CentOs 5, to CentOs7, and redeploying their configurations, but I can't right nowbecause the 2.5 clients aren't working.
This weekend, now that I am not on-call, I might tinker withcompiling, from scratch, a 3.3x client on some of those servers, inorder to test if it really is that, while not said outright, Amanda3.3.3 is not backwards compatible.
The clients log shows:
amandad: debug 1 pid 12389 ruid 33 euid 33: start at Thu Sep 2418:08:33 2015
amandad: version 2.5.0p2
amandad: build: VERSION="Amanda-2.5.0p2"
amandad:        BUILT_DATE="Thu Feb 23 08:03:44 EST 2012"
amandad: BUILT_MACH="Linux builder10.centos.org 2.6.18-53.el5#1 SMP Mon Nov 12 02:14:55 EST 2007 x86_64 x86_64 x86_64 GNU/Linux"
amandad:        CC="gcc"
amandad: CONFIGURE_COMMAND="'./configure''--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu''--target=x86_64-redhat-linux-gnu' '--program-prefix=''--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin''--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share''--includedir=/usr/include' '--libdir=/usr/lib64''--libexecdir=/usr/lib64/amanda' '--localstatedir=/var/lib''--sharedstatedir=/usr/com' '--mandir=/usr/share/man''--infodir=/usr/share/info' '--enable-shared' '--disable-static''--disable-dependency-tracking' '--with-index-server=amandahost''--with-tape-server=amandahost' '--with-config=DailySet1''--with-gnutar-listdir=/var/lib/amanda/gnutar-lists''--with-smbclient=/usr/bin/smbclient''--with-dumperdir=/usr/lib64/amanda/dumperdir' '--with-amandahosts''--with-user=amanda' '--with-group=disk''--with-tmpdir=/var/log/amanda' '--with-gnutar=/bin/tar''--with-ssh-security'"
amandad: paths: bindir="/usr/bin" sbindir="/usr/sbin"
amandad:        libexecdir="/usr/lib64/amanda" mandir="/usr/share/man"
amandad:        AMANDA_TMPDIR="/var/log/amanda"
amandad: AMANDA_DBGDIR="/var/log/amanda"CONFIG_DIR="/etc/amanda"
amandad:        DEV_PREFIX="/dev/" RDEV_PREFIX="/dev/r"
amandad:        DUMP="/sbin/dump" RESTORE="/sbin/restore" VDUMP=UNDEF
amandad: VRESTORE=UNDEF XFSDUMP=UNDEF XFSRESTORE=UNDEFVXDUMP=UNDEF
amandad:        VXRESTORE=UNDEF SAMBA_CLIENT="/usr/bin/smbclient"
amandad:        GNUTAR="/bin/tar" COMPRESS_PATH="/bin/gzip"
amandad:        UNCOMPRESS_PATH="/bin/gzip" LPRCMD="/usr/bin/lpr"
amandad:        MAILER="/usr/bin/Mail"
amandad:        listed_incr_dir="/var/lib/amanda/gnutar-lists"
amandad: defs:  DEFAULT_SERVER="amandahost" DEFAULT_CONFIG="DailySet1"
amandad:        DEFAULT_TAPE_SERVER="amandahost"
amandad:        DEFAULT_TAPE_DEVICE="null:" HAVE_MMAP HAVE_SYSVSHM
amandad:        LOCKING=POSIX_FCNTL SETPGRP_VOID DEBUG_CODE
amandad: AMANDA_DEBUG_DAYS=4 BSD_SECURITY RSH_SECURITYUSE_AMANDAHOSTS
amandad:        CLIENT_LOGIN="amanda" FORCE_USERID HAVE_GZIP
amandad:        COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
amandad:        COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
amandad: time 0.000: accept recv REQ pkt:
<<<<<
SERVICE noop
OPTIONS features=ffffffff9efefbffffffffff1f;
amandad: time 0.000: creating new service: /usr/lib64/amanda/noop
OPTIONS features=ffffffff9efefbffffffffff1f;

amandad: time 0.000: sending ACK pkt:
<<<<<
amandad: time 0.001: sending REP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffff07;
amandad: time 0.040: received ACK pkt:
<<<<<
amandad: time 30.038: pid 12389 finish time Thu Sep 24 18:09:03 2015

The server will check itself, and come back with no problems.
The amandad log file on the server, for the client check tolocalhost shows:
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: pid 27610ruid 33 euid 33 version 3.3.3: start at Thu Sep 24 18:14:58 2015Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:security_getdriver(name=ssh) returns 0x7f00865ac660
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: version 3.3.3
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: build:VERSION="Amanda-3.3.3"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:BUILT_DATE="Tue Jun 10 01:33:40 UTC 2014" BUILT_MACH=""Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:BUILT_REV="5099" BUILT_BRANCH="community_3_3_3" CC="gcc"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: paths:bindir="/usr/bin" sbindir="/usr/sbin"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:libexecdir="/usr/lib64" amlibexecdir="/usr/lib64/amanda"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:mandir="/usr/share/man" AMANDA_TMPDIR="/var/log/amanda"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:AMANDA_DBGDIR="/var/log/amanda" CONFIG_DIR="/etc/amanda"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:DEV_PREFIX="/dev/" RDEV_PREFIX="/dev/" DUMP=UNDEFThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: RESTORE=UNDEFVDUMP=UNDEF VRESTORE=UNDEF XFSDUMP=UNDEFThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:XFSRESTORE=UNDEF VXDUMP=UNDEF VXRESTORE=UNDEFThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:SAMBA_CLIENT="/usr/bin/smbclient" GNUTAR="/bin/tar"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:COMPRESS_PATH="/usr/bin/gzip"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:UNCOMPRESS_PATH="/usr/bin/gzip" LPRCMD=UNDEF MAILER=UNDEFThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:listed_incr_dir="/var/lib/amanda/gnutar-lists"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: defs:DEFAULT_SERVER="amandahost" DEFAULT_CONFIG="DailySet1"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:DEFAULT_TAPE_SERVER="amandahost" DEFAULT_TAPE_DEVICE=""Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: NEED_STRSTRAMFLOCK_POSIX AMFLOCK_FLOCK AMFLOCK_LOCKFThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:AMFLOCK_LNLOCK SETPGRP_VOID AMANDA_DEBUG_DAYS=4 BSD_SECURITYThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: KRB5_SECURITYRSH_SECURITY USE_AMANDAHOSTSThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:CLIENT_LOGIN="amandabackup" CHECK_USERID HAVE_GZIPThu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: parsing192.168.10.19Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:security_handleinit(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:security_streaminit(stream=0x7f008761e0e0, driver=0x7f00865ac660 (SSH))Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: authenticatedpeer name is 'amanda.tsukinokage.net'Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: accept recvREQ pkt:
<<<<<
SERVICE noop
OPTIONS features=ffffffff9efefbffffffffff1f;
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: creating newservice: noop
OPTIONS features=ffffffff9efefbffffffffff1f;

Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: sending ACK pkt:
<<<<<
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:tcpm_send_token: data is still flowing
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: sending REP pkt:
<<<<<
OPTIONS features=ffffffff9efefbffffffffff1f;
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: received ACKpkt:
<<<<<
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:security_close(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:security_stream_close(0x7f008761e0e0)Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:security_handleinit(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:security_streaminit(stream=0x7f00876263e0, driver=0x7f00865ac660 (SSH))Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: authenticatedpeer name is 'amanda.tsukinokage.net'Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: accept recvREQ pkt:
<<<<<
SERVICE selfcheck
OPTIONSfeatures=ffffffff9efefbffffffffff1f;maxdumps=1;hostname=amanda.tsukinokage.net;config=Tsukinokage-daily;
<dle>
  <program>GNUTAR</program>
  <estimate>CLIENT </estimate>
  <disk>/var/database</disk>
  <auth>ssh</auth>
  <compress>SERVER-BEST</compress>
  <record>YES</record>
  <index>YES</index>
  <datapath>AMANDA</datapath>
</dle>
<dle>
  <program>GNUTAR</program>
  <estimate>CLIENT </estimate>
  <disk>/etc</disk>
  <auth>ssh</auth>
  <compress>SERVER-BEST</compress>
  <record>YES</record>
  <index>YES</index>
  <datapath>AMANDA</datapath>
</dle>
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: creating newservice: selfcheckOPTIONSfeatures=ffffffff9efefbffffffffff1f;maxdumps=1;hostname=amanda.tsukinokage.net;config=Tsukinokage-daily;
<dle>
  <program>GNUTAR</program>
  <estimate>CLIENT </estimate>
  <disk>/var/database</disk>
  <auth>ssh</auth>
  <compress>SERVER-BEST</compress>
  <record>YES</record>
  <index>YES</index>
  <datapath>AMANDA</datapath>
</dle>
<dle>
  <program>GNUTAR</program>
  <estimate>CLIENT </estimate>
  <disk>/etc</disk>
  <auth>ssh</auth>
  <compress>SERVER-BEST</compress>
  <record>YES</record>
  <index>YES</index>
  <datapath>AMANDA</datapath>
</dle>

Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: sending ACK pkt:
<<<<<
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: sending REP pkt:
<<<<<
OK version 3.3.3
OK distro RPM
OK platform CentOS Linux release 7.1.1503 (Core)
OPTIONS features=ffffffff9efefbffffffffff1f;
OK /var/database
OK /var/database
OK /var/database
OK /etc
OK /etc
OK /etc
OK /usr/lib64/amanda/runtar executable
OK /bin/tar executable
OK /var/lib/amanda/gnutar-lists/. read/writable
OK /dev/null read/writable
OK /var/log/amanda has more than 64KB available.
OK /var/log/amanda has more than 64KB available.
OK /etc has more than 64KB available.
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: received ACKpkt:
<<<<<
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:security_close(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:security_stream_close(0x7f00876263e0)Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: pid 27610finish time Thu Sep 24 18:14:59 2015
--

Regards,
Seann
Deb,
I have updated the disklist, and dumptypes to match SSH. Tweakingthat didn't fix anything for me.
I spent a few hours removing all Amanda clients off the 'broken'servers, and manually building a 3.3.3 client on each of the clients.After each host was build, I ran amcheck against the Dailyconfiguration, for that specific host, and cleaned up any ownership,or path issues.
Currently I am testing a full site dump, now that each client isupdated. If this works, I will figure that the 2.5 client is notstable with the 3.3.3 server model, though it is supported, and youcan get away with running the mix, it will yield mixed results, andwould not be something I would consider production stable anymore.
Since it is a full major version different, I am not surprised onthis, but it is rather troublesome to be stuck migrating servers andnot seeing those notes in any of the release notes.
While the Wiki model for documentation is nice, having a fullsearchable Knowledge Base would be handy. Since I don't spend themoney on a commercial version, so I am not sure if it is offered there.
-- Regards, Seann

Just as a quick final update on this thread after upgrading all theclients to the 3.3.3 client, the backups have started working withoutissue the past 2 runs, plus a manual run.

With some thought to the path issue that I had seen noted in one of thereplies, I slapped together a quick and dirty bash script to handle thecheck, and dump, which I have both run from CRON. This resolved any ofthe CRON issues that I was having, and allowed me to run the commandsfrom a controlled path, set local to the script.

While having a 3.3.x server with 2.x clients is technically feasible,there are some major challenges to running them, regardless if you arerunning them from xinetd or from ssh. My configuration, for xinetd, hadall the clients set to use bsd auth, the 3.3.x clients were set to usebsdtcp, but the checks, and dumps would only work sporadically even tothe 3.x clients, which was not optimal.

As everything is up and running now, and looks good, in terms of backingup the 11 hosts I have configured, I will most likely expand that out alittle more to the rest of the environment (I have a max of about 20 orso servers to backup) and see how that runs.


Thank you to everyone who assisted me on this.

--

Regards,
Seann

Re: Upgrade woes and eternal hanging of dumps

Reply via email to