On 9/21/2015 11:36 AM, Debra S Baddorf wrote:
YES! I agree with the first and third of these tidbits. I just
couldn’t remember them. I’ve had issues with both of them.
Including the tricky firewall timeout part, in Idea Three.
Here’s hoping you have a network person who can add some skills or
ideas at that level. Or, just don’t do client estimates, as in
the first suggested fix.
I think we had to allow trusted clients to initiate their OWN
connections back to the server (via a firewall rule), so that they
could still talk to the server even after that server-created
conversation had timed out. That might count as fix #3, but it
takes firewall skills. That might be a slightly different
problem/situation (it sounds a little different) but I think it’s
in this same category, somewhere. Network savvy people, can you
translate my “generic English” description into what we actually did?
Deb Baddorf, Fermilab
On Sep 21, 2015, at 10:25 AM, Joi L. Ellis
<jlel...@pavlovmedia.com> wrote:
I've just read through the long thread prompted by this particular
post. I'd like to offer a few points I didn't see mentioned
before...
Idea one: You upgraded from 2.5 to 3.3. 2.5 amdump only spoke UDP
with a 'bsd' auth protocol, so that was the only action
available. Thus, inetd.conf didn't specify an -auth=bsd
parameter. 3.3 defaults to -auth=bsdtcp if you don't provide it.
Does your new configuration specify that those clients must be
reached with -auth=bsd from the new server, rather than the
server's new default of -auth=bsdctp?
Idea two: If any of the involved machines are running iptables or
ufw firewalls, verify the new configuration is still loading the
correct kernel modules. At one point the /etc/default/ufw.conf
file named kernel modules incorrectly after an upgrade, and/or the
nf_conntrack_amanda module itself went missing. (Some kernels
change the name of this module, usually it's the first two
characters.) The symptom here is that amcheck thinks everything
is fine, yet the actual amdump process fails because the UDP
control conversation between the server and the client is allowed,
but the TCP data stream amdump uses with -auth=bsdtcp is blocked.
Idea Three: I run an Amanda 3.3.3 server, and I have experienced a
similar problem to your own. I've tried posting about it here in
the past and got null response, so I gave up asking for help and
figured out my own workarounds.
My amanda server is behind a corporate firewall, and some of the
clients are in the DMZ, outside the firewall... and they are
running amanda 2.5 due to the age of the client hosts. I've had
repeated issues with the corporate firewall interfering with the
planner.
The issue I see is that the amanda server planner fires off a UDP
"connection" to the client, asking the client to provide
estimates. The client does so... BUT. That blasted firewall has
created a dynamic NAT rule that will allow the client to send back
its UDP response. IF the client's response doesn't appear before
the NAT rule expires, the planner falls into a permanent wait
state, waiting for a UDP response that will never arrive because
the firewall has blocked it. The client has no idea it failed,
and its logs look entirely normal.
If you dig into the server's logs, you will probably find TIMEOUT
errors in the logs from the planner. I don't have any recent logs
that illustrate this error, so I can't quote an example.
I worked around this in two ways (varies with the client situation:)
*) tell amanda to not use the client to create the estimate at all
*) adjust the NAT timeout rules on the firewall to extend the
timeout. As I recall, it was initially set to 120 seconds. We
moved it up to 300 seconds at one point, but then began to
experience issues with the firewall filling memory tables because
rules weren't timing out fast enough.
As I see it, the planner makes the (unsafe) assumption that IF its
initial request-an-estimate packets traveled properly, the
response will always do so. If there is a firewall involved, the
response might get lost, yet the planner will sit there forever,
twiddling its thumbs and not backing up anything, until it
receives the missing estimates package back from the client.
To summarize, I suspect that the move from 2.5's UDP-only
communication style to 3.3's default TCP-only style has broken
something in your environment that you've overlooked. Either the
server, the clients (or both) or a firewall (either an external
network firewall, or a kernel firewall on one of the hosts
involved) are breaking your planner. I've experienced very
similar symptoms after version upgades.
(And yes, I've seen my issues disappear when jobs are run
manually, yet still fail when run over night. Manual tests don't
trigger the firewall issues because the windows I have open the to
the client and server keep the darn firewall from timing out the
dynamic NAT rules.)
-----Original Message-----
From: owner-amanda-us...@amanda.org
[mailto:owner-amanda-us...@amanda.org]
On Behalf Of Seann
Sent: Monday, August 17, 2015 02:34 PM
To: amanda-users@amanda.org
Subject: Upgrade woes and eternal hanging of dumps
All,
I am looking for a little direction on a problem that has cropped
up for
me recently.
I have a backup set, that was created using Amanda 2.5 (default
on CentOS
5.11) and ran very well, both manually and from the cron job I
had set for
it.
It has approximately 13 hosts to backup, from as simple as
backing up a
single directory, to backing up the full system, and it ran with
no issues
on CentOS 5.11.
The basic setup is using hard drives as the backup media,
compressing the
backups to save space, using server compression, these also use
GNU-TAR as
the archive format.
Fast forward to today, I have the system upgraded to CentOS 7,
which also
upgraded to Amanda 3.3.3-13, and after some configuration file
re-writing,
I got most of the backups to work.
Two systems, one backing up the web directory, the other backing
up the
full disk, fail constantly.
When these two disklist statements are removed, the backup runs,
and takes
approximately 2 and a half hours to run on the 8 other hosts (the
other 3
hosts are currently offline and not in scope).
When the CRON job kicks off at midnight, it runs for over 12
hours (I have
the etimeout set to one day, as the planner kept dying saying to
timed
out).
This is the same basic error that I get with the two above mentioned
failing backups.
When the hung backup job is running, I see the dumpers and main dump
process running on the backup server, but nothing in the logs
outside of
the "We started the backup job" type of log messages.
On all of the hosts, I don't see the client running, nor to I see
any TAR
processes running.
There are also no clues in the logs on which host the server is
waiting
on, and checking all the hosts in scope show they are all in the
same
state, that is they have sent the estimate to the backup server
and are
waiting on the next phase.
Any help on this would be appreciated, and also is there a better
way of
making sense of the logs (such as using something like
Graylog2?), and on
reporting for issues with Amanda 3.3?
Regards,
Seann
It has been a while since I updated the few folk who have been
following this thread.
I haven't had much time to tinker with this, as it is on my home
network, and my day job got in the way.
To answer the questions from Joi:
1. Yes, I swapped the auth in the global section to use bsd for all
of my hosts, unless overridden in the disk definition.
2. 70% of the clients don't run a firewall, and those that do, allow
the UDP ports for Amanda 10080/udp by default.
3. All my clients are behind the same firewall as the Amanda server,
so no firewall outside of the host firewalls are in play with this.
Reading a few of the other threads on the list, there was the
mention of xinetd being depreciated, and since my setup wasn't
working in the first place, I flipped everything over to ssh.
I used puppet to push SSH keys to the proper user's on the client
machines, went through and ssh'd to each host by FQDN and accepted
the host keys, and ensured the users worked on the client machines.
When running amcheck on them now, what I get is this:
WARNING: server1.tsukinokage.net: selfcheck request failed: EOF on
read from server1.tsukinokage.net.
To save some reading space, I have the log output near the end of
the thread.
Searching for anything on this has been nearly impossible, namely as
to what the hel the 'EOF on read" is related to, and troubleshooting
steps
Ultimately I am stuck in a chicken and the egg situation, where I
need to back up directories on some servers, to the backup server,
prior to upgrading their operating systems from CentOs 5, to CentOs
7, and redeploying their configurations, but I can't right now
because the 2.5 clients aren't working.
This weekend, now that I am not on-call, I might tinker with
compiling, from scratch, a 3.3x client on some of those servers, in
order to test if it really is that, while not said outright, Amanda
3.3.3 is not backwards compatible.
The clients log shows:
amandad: debug 1 pid 12389 ruid 33 euid 33: start at Thu Sep 24
18:08:33 2015
amandad: version 2.5.0p2
amandad: build: VERSION="Amanda-2.5.0p2"
amandad: BUILT_DATE="Thu Feb 23 08:03:44 EST 2012"
amandad: BUILT_MACH="Linux builder10.centos.org 2.6.18-53.el5
#1 SMP Mon Nov 12 02:14:55 EST 2007 x86_64 x86_64 x86_64 GNU/Linux"
amandad: CC="gcc"
amandad: CONFIGURE_COMMAND="'./configure'
'--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu'
'--target=x86_64-redhat-linux-gnu' '--program-prefix='
'--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin'
'--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share'
'--includedir=/usr/include' '--libdir=/usr/lib64'
'--libexecdir=/usr/lib64/amanda' '--localstatedir=/var/lib'
'--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--enable-shared' '--disable-static'
'--disable-dependency-tracking' '--with-index-server=amandahost'
'--with-tape-server=amandahost' '--with-config=DailySet1'
'--with-gnutar-listdir=/var/lib/amanda/gnutar-lists'
'--with-smbclient=/usr/bin/smbclient'
'--with-dumperdir=/usr/lib64/amanda/dumperdir' '--with-amandahosts'
'--with-user=amanda' '--with-group=disk'
'--with-tmpdir=/var/log/amanda' '--with-gnutar=/bin/tar'
'--with-ssh-security'"
amandad: paths: bindir="/usr/bin" sbindir="/usr/sbin"
amandad: libexecdir="/usr/lib64/amanda" mandir="/usr/share/man"
amandad: AMANDA_TMPDIR="/var/log/amanda"
amandad: AMANDA_DBGDIR="/var/log/amanda"
CONFIG_DIR="/etc/amanda"
amandad: DEV_PREFIX="/dev/" RDEV_PREFIX="/dev/r"
amandad: DUMP="/sbin/dump" RESTORE="/sbin/restore" VDUMP=UNDEF
amandad: VRESTORE=UNDEF XFSDUMP=UNDEF XFSRESTORE=UNDEF
VXDUMP=UNDEF
amandad: VXRESTORE=UNDEF SAMBA_CLIENT="/usr/bin/smbclient"
amandad: GNUTAR="/bin/tar" COMPRESS_PATH="/bin/gzip"
amandad: UNCOMPRESS_PATH="/bin/gzip" LPRCMD="/usr/bin/lpr"
amandad: MAILER="/usr/bin/Mail"
amandad: listed_incr_dir="/var/lib/amanda/gnutar-lists"
amandad: defs: DEFAULT_SERVER="amandahost" DEFAULT_CONFIG="DailySet1"
amandad: DEFAULT_TAPE_SERVER="amandahost"
amandad: DEFAULT_TAPE_DEVICE="null:" HAVE_MMAP HAVE_SYSVSHM
amandad: LOCKING=POSIX_FCNTL SETPGRP_VOID DEBUG_CODE
amandad: AMANDA_DEBUG_DAYS=4 BSD_SECURITY RSH_SECURITY
USE_AMANDAHOSTS
amandad: CLIENT_LOGIN="amanda" FORCE_USERID HAVE_GZIP
amandad: COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
amandad: COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
amandad: time 0.000: accept recv REQ pkt:
<<<<<
SERVICE noop
OPTIONS features=ffffffff9efefbffffffffff1f;
amandad: time 0.000: creating new service: /usr/lib64/amanda/noop
OPTIONS features=ffffffff9efefbffffffffff1f;
amandad: time 0.000: sending ACK pkt:
<<<<<
amandad: time 0.001: sending REP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffff07;
amandad: time 0.040: received ACK pkt:
<<<<<
amandad: time 30.038: pid 12389 finish time Thu Sep 24 18:09:03 2015
The server will check itself, and come back with no problems.
The amandad log file on the server, for the client check to
localhost shows:
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: pid 27610
ruid 33 euid 33 version 3.3.3: start at Thu Sep 24 18:14:58 2015
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
security_getdriver(name=ssh) returns 0x7f00865ac660
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: version 3.3.3
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: build:
VERSION="Amanda-3.3.3"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
BUILT_DATE="Tue Jun 10 01:33:40 UTC 2014" BUILT_MACH=""
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
BUILT_REV="5099" BUILT_BRANCH="community_3_3_3" CC="gcc"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: paths:
bindir="/usr/bin" sbindir="/usr/sbin"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
libexecdir="/usr/lib64" amlibexecdir="/usr/lib64/amanda"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
mandir="/usr/share/man" AMANDA_TMPDIR="/var/log/amanda"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
AMANDA_DBGDIR="/var/log/amanda" CONFIG_DIR="/etc/amanda"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
DEV_PREFIX="/dev/" RDEV_PREFIX="/dev/" DUMP=UNDEF
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: RESTORE=UNDEF
VDUMP=UNDEF VRESTORE=UNDEF XFSDUMP=UNDEF
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
XFSRESTORE=UNDEF VXDUMP=UNDEF VXRESTORE=UNDEF
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
SAMBA_CLIENT="/usr/bin/smbclient" GNUTAR="/bin/tar"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
COMPRESS_PATH="/usr/bin/gzip"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
UNCOMPRESS_PATH="/usr/bin/gzip" LPRCMD=UNDEF MAILER=UNDEF
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
listed_incr_dir="/var/lib/amanda/gnutar-lists"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: defs:
DEFAULT_SERVER="amandahost" DEFAULT_CONFIG="DailySet1"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
DEFAULT_TAPE_SERVER="amandahost" DEFAULT_TAPE_DEVICE=""
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: NEED_STRSTR
AMFLOCK_POSIX AMFLOCK_FLOCK AMFLOCK_LOCKF
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
AMFLOCK_LNLOCK SETPGRP_VOID AMANDA_DEBUG_DAYS=4 BSD_SECURITY
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: KRB5_SECURITY
RSH_SECURITY USE_AMANDAHOSTS
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
CLIENT_LOGIN="amandabackup" CHECK_USERID HAVE_GZIP
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
COMPRESS_SUFFIX=".gz" COMPRESS_FAST_OPT="--fast"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
COMPRESS_BEST_OPT="--best" UNCOMPRESS_OPT="-dc"
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: parsing
192.168.10.19
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
security_handleinit(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
security_streaminit(stream=0x7f008761e0e0, driver=0x7f00865ac660 (SSH))
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: authenticated
peer name is 'amanda.tsukinokage.net'
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: accept recv
REQ pkt:
<<<<<
SERVICE noop
OPTIONS features=ffffffff9efefbffffffffff1f;
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: creating new
service: noop
OPTIONS features=ffffffff9efefbffffffffff1f;
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad: sending ACK pkt:
<<<<<
Thu Sep 24 18:14:58 2015: thd-0x7f0087613400: amandad:
tcpm_send_token: data is still flowing
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: sending REP pkt:
<<<<<
OPTIONS features=ffffffff9efefbffffffffff1f;
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: received ACK
pkt:
<<<<<
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:
security_close(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:
security_stream_close(0x7f008761e0e0)
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:
security_handleinit(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:
security_streaminit(stream=0x7f00876263e0, driver=0x7f00865ac660 (SSH))
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: authenticated
peer name is 'amanda.tsukinokage.net'
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: accept recv
REQ pkt:
<<<<<
SERVICE selfcheck
OPTIONS
features=ffffffff9efefbffffffffff1f;maxdumps=1;hostname=amanda.tsukinokage.net;config=Tsukinokage-daily;
<dle>
<program>GNUTAR</program>
<estimate>CLIENT </estimate>
<disk>/var/database</disk>
<auth>ssh</auth>
<compress>SERVER-BEST</compress>
<record>YES</record>
<index>YES</index>
<datapath>AMANDA</datapath>
</dle>
<dle>
<program>GNUTAR</program>
<estimate>CLIENT </estimate>
<disk>/etc</disk>
<auth>ssh</auth>
<compress>SERVER-BEST</compress>
<record>YES</record>
<index>YES</index>
<datapath>AMANDA</datapath>
</dle>
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: creating new
service: selfcheck
OPTIONS
features=ffffffff9efefbffffffffff1f;maxdumps=1;hostname=amanda.tsukinokage.net;config=Tsukinokage-daily;
<dle>
<program>GNUTAR</program>
<estimate>CLIENT </estimate>
<disk>/var/database</disk>
<auth>ssh</auth>
<compress>SERVER-BEST</compress>
<record>YES</record>
<index>YES</index>
<datapath>AMANDA</datapath>
</dle>
<dle>
<program>GNUTAR</program>
<estimate>CLIENT </estimate>
<disk>/etc</disk>
<auth>ssh</auth>
<compress>SERVER-BEST</compress>
<record>YES</record>
<index>YES</index>
<datapath>AMANDA</datapath>
</dle>
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: sending ACK pkt:
<<<<<
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: sending REP pkt:
<<<<<
OK version 3.3.3
OK distro RPM
OK platform CentOS Linux release 7.1.1503 (Core)
OPTIONS features=ffffffff9efefbffffffffff1f;
OK /var/database
OK /var/database
OK /var/database
OK /etc
OK /etc
OK /etc
OK /usr/lib64/amanda/runtar executable
OK /bin/tar executable
OK /var/lib/amanda/gnutar-lists/. read/writable
OK /dev/null read/writable
OK /var/log/amanda has more than 64KB available.
OK /var/log/amanda has more than 64KB available.
OK /etc has more than 64KB available.
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: received ACK
pkt:
<<<<<
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:
security_close(handle=0x7f008761ded0, driver=0x7f00865ac660 (SSH))
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad:
security_stream_close(0x7f00876263e0)
Thu Sep 24 18:14:59 2015: thd-0x7f0087613400: amandad: pid 27610
finish time Thu Sep 24 18:14:59 2015
--
Regards,
Seann