Re: amflush run while amdump underway tries to flush .tmp files

2017-11-24 Thread Jose M Calhariz
I am interested in this fix.  There is a final patch?

Or is better to use the code from git?

Kind regards
JOse M Calhariz


On Tue, Nov 21, 2017 at 12:57:00PM -0500, Nathan Stratton Treadway wrote:
> On Tue, Nov 21, 2017 at 11:45:52 -0500, Jean-Louis Martineau wrote:
> > There is no patch attached on your email.
> 
> Ah, sorry, attached here.
> 
> > 
> > I do not see how Amanda::Util::setup_application call code in holding.c
> 
> To be precise what I actually saw was activity involving the "pid" files
> in the holding directory showing up in an strace log.  I am not sure how
> it got called, but assume it was coming from holding.c because it had
> the arguments for the kill syscall in the correct order (and thus I
> believe the activity couldn't have been from the then-in-place
> Holding.pm code).  I will investigate further and let you know what I
> find.  
> 
> (In any case, I am sure that the pid read from the holding-directory pid
> file during the get_all_datestamps() call was the parent pid of the
> current process [and of course the file had not contained that value
> before amflush was invoked].)
> 
> 
>   Nathan
> 
> 
> Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
> Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
>  GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
>  Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239

> --- Holding.pm.orig   2017-11-20 16:51:32.0 -0500
> +++ Holding.pm2017-11-21 02:10:47.992749021 -0500
> @@ -25,6 +25,7 @@
>  use File::stat;
>  use IO::Dir;
>  use POSIX qw( :fcntl_h );
> +use Errno;
>  use Math::BigInt;
>  use strict;
>  use warnings;
> @@ -198,6 +199,36 @@
>  return 1;
>  }
>  
> +# check to see if the given pid identifies a process that could
> +# validily be holding the Amanda directory lock (thus preventing 
> +# us from taking it).  (See also holding.c:can_take_holding() for
> +# related logic..)
> +sub _valid_locking_pid {
> +my $pid_to_check = shift;
> +
> +if ( $pid_to_check == $$ || $pid_to_check == getppid() ) {
> +# this process or the parent already hold the lock, so we
> +# can go ahead and use the directory.
> +return 0;
> +}
> +
> +if (kill(0, $pid_to_check) == 0 ) {
> +if ($! == Errno::EPERM) {
> + # the process exists  (but we don't have permission to kill it) 
> + return 1;
> +}
> +else {
> + # (if we reach here, presumably the errno was ESRCH)
> + # the process does not exist; we can use the directory
> + return 0;
> +} 
> +}
> +else {
> + # kill succeeded, so process does exist
> + return 1;
> +}
> +}
> +
>  sub _walk {
>  my ($file_fn, $verbose, $take_pid) = @_;
>  
> @@ -233,8 +264,10 @@
>   if (open(my $pidh,  $pidfn)) {
>   my $pid = <$pidh>;
>   close($pidh);
> - if (kill($pid, 0) == 0) {
> - # pid is alive, skip this directory
> +
> + if ( _valid_locking_pid($pid) ) {
> + # $pid seems valid, so we skip over this directory
> + print $verbose "skipping directory '$dirfn' due to lock by 
> pid $pid. \n" if $verbose;
>   next;
>   }
>   }


-- 
--

O dinheiro é simplesmente aquilo que o Estado, de tempos em tempos, declara ser 
um bom instrumento legal para saldar contratos em dinheiro

--John Maynard Keynes


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-22 Thread Jean-Louis Martineau
Nathan,

I committed the attached patch, It fix all the issues I saw.

Can you try to use the SVN or GIT repository? they have other fixes.

SVN:

 1. svn checkout svn://svn.code.sf.net/p/amanda/code/amanda/branches/3_5
amanda-3-5
 2. cd amanda-3-5
 3. ./autogen
 4. ./configure 
 5. make

GIT:

 1. git init git-amanda-3.5
 2. cd git-amanda-3.5
 3. git remote add zmanda https://github.com/zmanda/amanda.git
 4. git fetch zmanda
 5. git branch --track 3_5 zmanda/3_5
 6. git checkout 3_5
 7. ./autogen
 8. ./configure ...
 9. make

Jean-Louis

On 21/11/17 08:13 PM, Nathan Stratton Treadway wrote:
> On Mon, Nov 20, 2017 at 15:09:25 -0500, Jean-Louis Martineau wrote:
> > perl automatically close it when it come out of scope.
> > The close should before the 'if (kill' line.
>
> Okay, I moved the close() line up to that spot as I applied the patch.
>
> I can confirm that with your patch applied and my followup patch
> applied, if I kick off amflush while amdump is running in the
> background, it (amflush) prints out
> Could not find any Amanda directories to flush.
> and exits without attempting to flush anything but if I try again
> after amdump finished, I'm able to complete the flush successfully.
>
>
> A few followup points that came up during my testing:
>
> * when amflush completes now, the pid file is left out there in the
> holding directory, thus preventing the directory from getting deleted:
>
> =
> # ls -lR /amanda/TestBackup-holding/*
> /amanda/TestBackup-holding/20171120172656:
> total 4
> -rw--- 1 backup backup 5 Nov 21 18:50 pid
>
> /amanda/TestBackup-holding/20171121145009:
> total 4
> -rw--- 1 backup backup 5 Nov 21 18:50 pid
> =
>
> Off hand I don't see code in amflush or Amflush.pm 
>  
> trying to clean up
> the holding disk directories so I don't know if these "zombie" pid
> files are related to the third issue (i.e. holding_cleanup_dir()),
> but...
>
> * ...as a more general question I was wondering if it made sense
> for programs that take a lock on a holding directory to have a way to
> explicitly release that lock when they are done? (Probably this
> would be by deleting the pid file, thought it could instead be
> truncating it to zero length, or something.)
>
> The main reason I was wondering about this is that with the current
> behavior of leaving a particular pid in the pid file forever, it
> seems like there's a good chance of false positives when the lock is
> checked by later jobs (i.e. some other, unrelated process using the
> same pid when the lock-check happens). If there were some way for a
> process to release the lock, false positives could only happen after a
> process crashed.
>
> (I don't know how often these false positives would happen in
> practice, but I can imagine it would cause a headache if the pid
> locking a directory containing dumps needing to be flushed to vault
> storage ended up getting reused by some long-runing process...)
>
> * I noticed that holding.c:holding_cleanup_dir() doesn't include parent
> pid in the pid-file check. I haven't tracked back to figure out which
> applications call the holding_cleanup* family of functions so I don't
> know if that will cause actual problems, but I was wondering if that
> function should instead just call call_take_holding() instead, so the
> logic is all found in one place?
>
>
> Nathan
>
>
>
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail
diff --git a/installcheck/Amanda_Holding.pl b/installcheck/Amanda_Holding.pl
index e8f206c..974ba95 100644
--- a/installcheck/Amanda_Holding.pl
+++ b/installcheck/Amanda_Holding.pl
@@ -127,7 +127,7 @@ is_deeply([ sort(+Amanda::Holding::disks()) ],
 [ sort($holding1, $holding2) ],
 "all active holding disks, but not inactive (defined but not used) disks");
 
-is_deeply([ sort(+Amanda::Holding::files()) ],
+is_deeply([ sort(+Amanda::Holding::files(0)) ],
 [ sort(
 	"$holding1/2007030300/videoserver._video_a",
 	"$holding1/20070306123456/videoserver._video_a",
@@ -163,17 +163,17 @@ is_deeply([ Amanda::Holding::get_all_datestamps() ],
 	  [ sort("2007030300", "20070306123456") ],
 	  "get_all_datestamps");
 
-is_deeply([ sort(+Amanda::Holding::get_files_for_flush("023985")) ],
+is_deeply([ sort(+Amanda::Holding::get_files_for_flush(0, "023985")) ],
 	  [ sort() ],
 	  

Re: amflush run while amdump underway tries to flush .tmp files

2017-11-22 Thread Jean-Louis Martineau
Nathan,

search_holding_disk should not take the pid file.

Jean-Louis

On 21/11/17 06:49 PM, Nathan Stratton Treadway wrote:
> On Tue, Nov 21, 2017 at 12:57:00 -0500, Nathan Stratton Treadway wrote:
> > > I do not see how Amanda::Util::setup_application call code in 
> holding.c
> >
> > To be precise what I actually saw was activity involving the "pid" files
> > in the holding directory showing up in an strace log. I am not sure how
> > it got called, but assume it was coming from holding.c because it had
> > the arguments for the kill syscall in the correct order (and thus I
> > believe the activity couldn't have been from the then-in-place
> > Holding.pm 
>  
> code). I will investigate further and let you know what I
> > find.
> >
>
> Okay, sorry for the confusion -- now that I have paid careful attention
> to the part of amflush prior to the Holding::get_all_datestamps() call,
> I see that it's not setup_application() that updates the pid files, but
> rather Disklist:add_holding_to_disklist().
>
> From there the program flow appears to be:
> Amanda::Logfile::search_holding_disk -> find.c:search_holding_disk()
> -> holding.c:holding_get_files() -> holding.c:holding_walk() ->
> holding_dir_stop_if_pid_fn() -> take_holding_pid()
>
> (Obviously the above flow only applies after the holding-pid-3.5.diff
> patch has been applied.)
>
> In any case, the pid file is updated by this call in the early part of
> the amflush run (see below), which explains why the patch for the
> Holding.pm 
>  
> logic needed to include support for the pid file pointing to
> current/parent pid
>
>
> (In order to confirm the origin of the pid updates, I added
> system("find /amanda/TestBackup-holding -name pid -ls");
> and a print-message line immediately before and after the call to
> add_holding_to_disklist(), and here is the resulting output:
>
> =
> # su backup -lc "amflush TestBackup"
> 5636101 4 -rw--- 1 backup backup 5 Nov 21 18:29 
> /amanda/TestBackup-holding/20171121145009/pid
> 5636099 4 -rw--- 1 backup backup 5 Nov 21 18:29 
> /amanda/TestBackup-holding/20171120172656/pid
>  calling Disklist:add_holding_to_disklist 
>  returned from Disklist:add_holding_to_disklist 
> 5636101 4 -rw--- 1 backup backup 5 Nov 21 18:32 
> /amanda/TestBackup-holding/20171121145009/pid
> 5636099 4 -rw--- 1 backup backup 5 Nov 21 18:32 
> /amanda/TestBackup-holding/20171120172656/pid
> Flushing dumps from 20171121145009 using storage "TestBackup", 
> tapechanger "".
> To volume TESTBACKUP-06 or a new volume, (The last dumps were to 
> volume TESTBACKUP-03
> Flushing dumps from 20171121145009 using storage "TestOffsite", tape 
> changer "doxpop_offsite_vtapes".
> To volume TESTBACKUP-109 or a new volume, (The last dumps were to 
> volume TESTBACKUP-03
>
> Are you sure you want to do this [yN]?
> =
>
> [I know from my earlier debugging that the value put into the updated
> pid file is actually the parent pid.])
> Nathan
>
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-22 Thread Jean-Louis Martineau
For amdump, it is the planner that create the pid file, it must put its 
ppid in it (amdump process) (done in holding.c)
For amflush, it is amflush that create the pid file, it must put its pid 
in it (done in Holding.pm)
In both case, the driver call cleanup_holding to remove the pid, you are 
right, it should check for ppid

Jean-Louis

On 21/11/17 08:13 PM, Nathan Stratton Treadway wrote:
> On Mon, Nov 20, 2017 at 15:09:25 -0500, Jean-Louis Martineau wrote:
> > perl automatically close it when it come out of scope.
> > The close should before the 'if (kill' line.
>
> Okay, I moved the close() line up to that spot as I applied the patch.
>
> I can confirm that with your patch applied and my followup patch
> applied, if I kick off amflush while amdump is running in the
> background, it (amflush) prints out
> Could not find any Amanda directories to flush.
> and exits without attempting to flush anything but if I try again
> after amdump finished, I'm able to complete the flush successfully.
>
>
> A few followup points that came up during my testing:
>
> * when amflush completes now, the pid file is left out there in the
> holding directory, thus preventing the directory from getting deleted:
>
> =
> # ls -lR /amanda/TestBackup-holding/*
> /amanda/TestBackup-holding/20171120172656:
> total 4
> -rw--- 1 backup backup 5 Nov 21 18:50 pid
>
> /amanda/TestBackup-holding/20171121145009:
> total 4
> -rw--- 1 backup backup 5 Nov 21 18:50 pid
> =
>
> Off hand I don't see code in amflush or Amflush.pm 
>  
> trying to clean up
> the holding disk directories so I don't know if these "zombie" pid
> files are related to the third issue (i.e. holding_cleanup_dir()),
> but...
>
> * ...as a more general question I was wondering if it made sense
> for programs that take a lock on a holding directory to have a way to
> explicitly release that lock when they are done? (Probably this
> would be by deleting the pid file, thought it could instead be
> truncating it to zero length, or something.)
>
> The main reason I was wondering about this is that with the current
> behavior of leaving a particular pid in the pid file forever, it
> seems like there's a good chance of false positives when the lock is
> checked by later jobs (i.e. some other, unrelated process using the
> same pid when the lock-check happens). If there were some way for a
> process to release the lock, false positives could only happen after a
> process crashed.
>
> (I don't know how often these false positives would happen in
> practice, but I can imagine it would cause a headache if the pid
> locking a directory containing dumps needing to be flushed to vault
> storage ended up getting reused by some long-runing process...)
>
> * I noticed that holding.c:holding_cleanup_dir() doesn't include parent
> pid in the pid-file check. I haven't tracked back to figure out which
> applications call the holding_cleanup* family of functions so I don't
> know if that will cause actual problems, but I was wondering if that
> function should instead just call call_take_holding() instead, so the
> logic is all found in one place?
>
>
> Nathan
>
>
>
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 15:09:25 -0500, Jean-Louis Martineau wrote:
> perl automatically close it when it come out of scope.
> The close should before the 'if (kill' line.

Okay, I moved the close() line up to that spot as I applied the patch.

I can confirm that with your patch applied and my followup patch
applied, if I kick off amflush while amdump is running in the
background, it (amflush) prints out
  Could not find any Amanda directories to flush.
and exits without attempting to flush anything but if I try again
after amdump finished, I'm able to complete the flush successfully.


A few followup points that came up during my testing:

* when amflush completes now, the pid file is left out there in the
  holding directory, thus preventing the directory from getting deleted:

=
# ls -lR /amanda/TestBackup-holding/*
/amanda/TestBackup-holding/20171120172656:
total 4
-rw--- 1 backup backup 5 Nov 21 18:50 pid

/amanda/TestBackup-holding/20171121145009:
total 4
-rw--- 1 backup backup 5 Nov 21 18:50 pid
=

  Off hand I don't see code in amflush or Amflush.pm trying to clean up
  the holding disk directories so I don't know if these "zombie" pid
  files are related to the third issue (i.e. holding_cleanup_dir()),
  but...

* ...as a more general question I was wondering if it made sense
  for programs that take a lock on a holding directory to have a way to
  explicitly release that lock when they are done?  (Probably this
  would be by deleting the pid file, thought it could instead be
  truncating it to zero length, or something.)

  The main reason I was wondering about this is that with the current
  behavior of leaving a particular pid in the pid file forever, it
  seems like there's a good chance of false positives when the lock is
  checked by later jobs (i.e. some other, unrelated process using the
  same pid when the lock-check happens).  If there were some way for a
  process to release the lock, false positives could only happen after a
  process crashed.

  (I don't know how often these false positives would happen in
  practice, but I can imagine it would cause a headache if the pid
  locking a directory containing dumps needing to be flushed to vault
  storage ended up getting reused by some long-runing process...)

* I noticed that  holding.c:holding_cleanup_dir() doesn't include parent
  pid in the pid-file check.  I haven't tracked back to figure out which 
  applications call the holding_cleanup* family of functions so I don't
  know if that will cause actual problems, but I was wondering if that 
  function should instead just call call_take_holding() instead, so the
  logic is all found in one place?


Nathan





Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Nathan Stratton Treadway
On Tue, Nov 21, 2017 at 12:57:00 -0500, Nathan Stratton Treadway wrote:
> > I do not see how Amanda::Util::setup_application call code in holding.c
> 
> To be precise what I actually saw was activity involving the "pid" files
> in the holding directory showing up in an strace log.  I am not sure how
> it got called, but assume it was coming from holding.c because it had
> the arguments for the kill syscall in the correct order (and thus I
> believe the activity couldn't have been from the then-in-place
> Holding.pm code).  I will investigate further and let you know what I
> find.  
> 

Okay, sorry for the confusion -- now that I have paid careful attention
to the part of amflush prior to the Holding::get_all_datestamps() call,
I see that it's not setup_application() that updates the pid files, but
rather Disklist:add_holding_to_disklist().

>From there the program flow appears to be:
  Amanda::Logfile::search_holding_disk -> find.c:search_holding_disk()
  -> holding.c:holding_get_files() -> holding.c:holding_walk() ->
  holding_dir_stop_if_pid_fn() -> take_holding_pid()  

(Obviously the above flow only applies after the holding-pid-3.5.diff
patch has been applied.)

In any case, the pid file is updated by this call in the early part of
the amflush run (see below), which explains why the patch for the
Holding.pm logic needed to include support for the pid file pointing to
current/parent pid


(In order to confirm the origin of the pid updates, I added 
  system("find /amanda/TestBackup-holding -name pid -ls");
and a print-message line immediately before and after the call to
add_holding_to_disklist(), and here is the resulting output:

=
# su backup -lc "amflush TestBackup"
  5636101  4 -rw---   1 backup   backup  5 Nov 21 18:29 
/amanda/TestBackup-holding/20171121145009/pid
  5636099  4 -rw---   1 backup   backup  5 Nov 21 18:29 
/amanda/TestBackup-holding/20171120172656/pid
 calling Disklist:add_holding_to_disklist 
 returned from Disklist:add_holding_to_disklist 
  5636101  4 -rw---   1 backup   backup  5 Nov 21 18:32 
/amanda/TestBackup-holding/20171121145009/pid
  5636099  4 -rw---   1 backup   backup  5 Nov 21 18:32 
/amanda/TestBackup-holding/20171120172656/pid
Flushing dumps from 20171121145009 using storage "TestBackup", tapechanger "".
To volume TESTBACKUP-06 or a new volume,  (The last dumps were to volume 
TESTBACKUP-03
Flushing dumps from 20171121145009 using storage "TestOffsite", tape changer 
"doxpop_offsite_vtapes".
To volume TESTBACKUP-109 or a new volume,  (The last dumps were to volume 
TESTBACKUP-03

Are you sure you want to do this [yN]? 
=

[I know from my earlier debugging that the value put into  the updated
pid file is actually the parent pid.])
Nathan



Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Nathan Stratton Treadway
On Tue, Nov 21, 2017 at 11:45:52 -0500, Jean-Louis Martineau wrote:
> There is no patch attached on your email.

Ah, sorry, attached here.

> 
> I do not see how Amanda::Util::setup_application call code in holding.c

To be precise what I actually saw was activity involving the "pid" files
in the holding directory showing up in an strace log.  I am not sure how
it got called, but assume it was coming from holding.c because it had
the arguments for the kill syscall in the correct order (and thus I
believe the activity couldn't have been from the then-in-place
Holding.pm code).  I will investigate further and let you know what I
find.  

(In any case, I am sure that the pid read from the holding-directory pid
file during the get_all_datestamps() call was the parent pid of the
current process [and of course the file had not contained that value
before amflush was invoked].)


Nathan


Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239
--- Holding.pm.orig	2017-11-20 16:51:32.0 -0500
+++ Holding.pm	2017-11-21 02:10:47.992749021 -0500
@@ -25,6 +25,7 @@
 use File::stat;
 use IO::Dir;
 use POSIX qw( :fcntl_h );
+use Errno;
 use Math::BigInt;
 use strict;
 use warnings;
@@ -198,6 +199,36 @@
 return 1;
 }
 
+# check to see if the given pid identifies a process that could
+# validily be holding the Amanda directory lock (thus preventing 
+# us from taking it).  (See also holding.c:can_take_holding() for
+# related logic..)
+sub _valid_locking_pid {
+my $pid_to_check = shift;
+
+if ( $pid_to_check == $$ || $pid_to_check == getppid() ) {
+# this process or the parent already hold the lock, so we
+# can go ahead and use the directory.
+return 0;
+}
+
+if (kill(0, $pid_to_check) == 0 ) {
+if ($! == Errno::EPERM) {
+	# the process exists  (but we don't have permission to kill it) 
+	return 1;
+}
+else {
+	# (if we reach here, presumably the errno was ESRCH)
+	# the process does not exist; we can use the directory
+	return 0;
+} 
+}
+else {
+	# kill succeeded, so process does exist
+	return 1;
+}
+}
+
 sub _walk {
 my ($file_fn, $verbose, $take_pid) = @_;
 
@@ -233,8 +264,10 @@
 	if (open(my $pidh,  $pidfn)) {
 		my $pid = <$pidh>;
 		close($pidh);
-		if (kill($pid, 0) == 0) {
-		# pid is alive, skip this directory
+
+		if ( _valid_locking_pid($pid) ) {
+		# $pid seems valid, so we skip over this directory
+		print $verbose "skipping directory '$dirfn' due to lock by pid $pid. \n" if $verbose;
 		next;
 		}
 	}


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-21 Thread Jean-Louis Martineau
Nathan,

There is no patch attached on your email.

I do not see how Amanda::Util::setup_application call code in holding.c

Jean-Louis

On 21/11/17 02:52 AM, Nathan Stratton Treadway wrote:
> On Mon, Nov 20, 2017 at 19:03:52 -0500, Nathan Stratton Treadway wrote:
> > On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> > > + if (open(my $pidh, $pidfn)) {
> > > + my $pid = <$pidh>;
> > > + if (kill($pid, 0) == 0) {
> > > + # pid is alive, skip this directory
> > > + next;
> > > + }
> >
> > looks like the order of arguments for the kill() function is reversed in
> > Perl, so the above code fails. I will send a patch with a fix shortly.
>
> When I tried using amflush after applying this patch, I found that
> it was never finding any directories it was willing to flush.
>
> When I debugged a little, I discovered that the Perl kill function
> actually has arguments in reverse order:
> kill SIGNAL, LIST
>
> However, even after fixing that function call amflush still didn't find
> any directories it was willing to flush. Debugging further revealed
> that the "Amanda::Util::setup_application()" call at the top of the
> amflush script ends up (via holding.c) taking over the pid file -- but
> using the parent pid rather than the current process pid. In the case
> of amflush, the parent pid is the shell that invoked amflush, so it's
> still running when the program flow reaches the call to
> Amanda::Holding::get_all_datestamps() , and thus the _walk function
> always thought the directory-lock was still validly held.
>
> The attached patch attempts to fix those problems... but the logic to
> deal with all the above conditions was complex enough that I figured it
> would work better to move it into a separate function.
>
> I have not fully tested this patch yet, but figured I'd send it back so
> you can take a look at it and see if it seems liek the right approach.
>
> (For example, let me know if you can think of a better name than
> "_valid_locking_pid".)
>
>
> Nathan
>
> p.s. I based the kill() ==0/$! == Errno::EPERM logic off of the
> discussion found here:
> https://unix.stackexchange.com/a/294990 
> 
> (i.e. the answer by Stephen Harris found in:
> https://unix.stackexchange.com/questions/294984/how-should-i-check-whether-a-given-pid-is-running/
>  
> 
> )
>
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-20 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 19:03:52 -0500, Nathan Stratton Treadway wrote:
> On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> > +   if (open(my $pidh,  $pidfn)) {
> > +   my $pid = <$pidh>;
> > +   if (kill($pid, 0) == 0) {
> > +   # pid is alive, skip this directory
> > +   next;
> > +   }
> 
> looks like the order of arguments for the kill() function is reversed in
> Perl, so the above code fails.  I will send a patch with a fix shortly.

When I tried using amflush after applying this patch, I found that
it was never finding any directories it was willing to flush.

When I debugged a little, I discovered that the Perl kill function
actually has arguments in reverse order:
  kill SIGNAL, LIST

However, even after fixing that function call amflush still didn't find
any directories it was willing to flush.  Debugging further revealed
that the "Amanda::Util::setup_application()" call at the top of the
amflush script ends up (via holding.c) taking over the pid file -- but
using the parent pid rather than the current process pid.  In the case
of amflush, the parent pid is the shell that invoked amflush, so it's
still running when the program flow reaches the call to
Amanda::Holding::get_all_datestamps() , and thus the _walk function
always thought the directory-lock was still validly held.

The attached patch attempts to fix those problems... but the logic to
deal with all the above conditions was complex enough that I figured it
would work better to move it into a separate function.

I have not fully tested this patch yet, but figured I'd send it back so
you can take a look at it and see if it seems liek the right approach.

(For example, let me know if you can think of a better name than
"_valid_locking_pid".)


Nathan

p.s. I based the kill() ==0/$! == Errno::EPERM logic off of the
discussion found here:
  https://unix.stackexchange.com/a/294990
(i.e. the answer by Stephen Harris found in:
  
https://unix.stackexchange.com/questions/294984/how-should-i-check-whether-a-given-pid-is-running/
)



Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-20 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> + if (open(my $pidh,  $pidfn)) {
> + my $pid = <$pidh>;
> + if (kill($pid, 0) == 0) {
> + # pid is alive, skip this directory
> + next;
> + }

looks like the order of arguments for the kill() function is reversed in
Perl, so the above code fails.  I will send a patch with a fix shortly.

Nathan



Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-20 Thread Jean-Louis Martineau
On 20/11/17 02:56 PM, Nathan Stratton Treadway wrote:
> On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> > Each process should take a lock on the holding directory (a pid file),
>
> Okay, makes sense.
>
>
> > + my $pidfn = File::Spec->catfile($dirfn, "pid");
> > + if (open(my $pidh, $pidfn)) {
> > + my $pid = <$pidh>;
> > + if (kill($pid, 0) == 0) {
> > + # pid is alive, skip this directory
> > + next;
> > + }
> > + close($pidh);
>
> Quick question: does the above leave $pidh open in the case that pid is
> alive? (That is, if the code hits the "next" does that miss a needed
> call to close()?)

perl automatically close it when it come out of scope.
The close should before the 'if (kill' line.

I appreciate your reviews

Jean-Louis
>
> Nathan
>
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-20 Thread Nathan Stratton Treadway
On Mon, Nov 20, 2017 at 14:42:01 -0500, Jean-Louis Martineau wrote:
> Each process should take a lock on the holding directory (a pid file), 

Okay, makes sense.


> + my $pidfn = File::Spec->catfile($dirfn, "pid");
> + if (open(my $pidh,  $pidfn)) {
> + my $pid = <$pidh>;
> + if (kill($pid, 0) == 0) {
> + # pid is alive, skip this directory
> + next;
> + }
> + close($pidh);

Quick question: does the above leave $pidh open in the case that pid is
alive?  (That is, if the code hits the "next" does that miss a needed
call to close()?)

Nathan



Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239


Re: amflush run while amdump underway tries to flush .tmp files

2017-11-20 Thread Jean-Louis Martineau
Each process should take a lock on the holding directory (a pid file), 
The attached path fix it.

Please try it.


On 17/11/17 12:38 AM, Nathan Stratton Treadway wrote:
> (Amanda v3.5)
>
> I noticed that Amanda 3.5 no longer aborts amflush if amdump is
> currently running (as older versions of Amanda do).
>
> So out of curiousity I kicked of "amflush TestBackup" while amdump was
> busy dumping to the holding disk... and I discovered that amflush
> actually tries to go ahead and flush the ".tmp" file files that it finds
> in the holding directory:
>
> = From /var/log/amanda/server/TestBackup/amflush.20171116200510.debug:
> Thu Nov 16 20:05:17.590062176 2017: pid 26860: thd-0x2f07e00: amflush: 
> flushing /amanda/TestBackup-holding/2017111622/client1._.0.tmp
> Thu Nov 16 20:05:17.590096226 2017: pid 26860: thd-0x2f07e00: amflush: 
> flushing /amanda/TestBackup-holding/2017111622/client2._.1.tmp
> =
>
> In this case the taper failed (and thus the amflush didn't actually do
> anything with the .tmp files):
>
> = From /var/log/amanda/TestBackup/amdump.1:
> driver: send-cmd time 14.132 to taper0: FILE-WRITE worker0-0 00-2 
> /amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 0 
> 2017111622 "" "" "" "" "" "" "" "" 0
> writing taper command 'FILE-WRITE worker0-0 00-2 
> /amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 0 
> 2017111622 "" "" "" "" "" "" "" "" 0
> ' failed: Broken pipe
> =
> (There is a line break in the log file just before "' failed".)
>
>
> = From /var/log/amanda/server/TestBackup/taper.20171116200517.debug:
> Thu Nov 16 20:05:18.502419601 2017: pid 26862: thd-0x4232000: taper: 
> Building type SPLIT_FILE header of 32768-32768 bytes with 
> name='client1' disk='/' dumplevel=0 and blocksize=32768
> Thu Nov 16 20:05:22.427709157 2017: pid 26862: thd-0x4232050: taper: 
> no next_filename
> Thu Nov 16 20:05:22.427743969 2017: pid 26862: thd-0x4232050: taper: 
> sending XMSG_CRC message
> Thu Nov 16 20:05:22.427748905 2017: pid 26862: thd-0x4232050: taper: 
> xfer-source-holding CRC: 2e4f7128 size: 249856000
> Thu Nov 16 20:05:22.427757739 2017: pid 26862: thd-0x4232050: taper: 
> xfer_queue_message: MSG:  elt= version=0>
> Thu Nov 16 20:05:22.427767783 2017: pid 26862: thd-0x4232050: taper: 
> xfer-source-holding sending XMSG_DONE message
> Thu Nov 16 20:05:22.427773216 2017: pid 26862: thd-0x4232050: taper: 
> xfer_queue_message: MSG:  elt= version=0>
> [ *** file ends abruptly here ***]
> =
>
>
>  but whether or not that indicates a bug in the taper, it seems like
> amflush should not ever try to flush .tmp files from the holding disk...
> (right?)
>
>
>
> Finally, after this testing I notice that the command_file still has
> FLUSH commands for those .tmp files (even though neither the files nor
> the containing holding directory now exist). I've run both "amdump" and
> "amflush" since then, and tried "amcleanup" as well. Is there any
> (good) way to clean up these orphan commands?
No, you must manually remove them. Do it when no other amanda processes 
are running.

Jean-Louis
>
> = From /etc/amanda/TestBackup/command_file:
> ID 1633
> 1603 FLUSH TestBackup 
> /amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 
> 2017111622 0 TestBackup WORKING:17072 TODO
> 1604 FLUSH TestBackup 
> /amanda/TestBackup-holding/2017111622/client2._.1.tmp client2 / 
> 2017111622 0 TestBackup WORKING:17072 TODO
> =
>
> =
> # ls -l /amanda/TestBackup-holding/
> total 0
> =
>
>
> Nathan
>
> 
> Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region
> Ray Ontko & Co. - Software consulting services - http://www.ontko.com/ 
> 
> GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt 
>  
> ID: 1023D/ECFB6239
> Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
This message is the property of CARBONITE, INC. and may contain confidential or 
privileged information.
If this message has been delivered to you by mistake, then do not copy or 
deliver this message to anyone.  Instead, destroy it and notify me by reply 
e-mail
diff --git a/perl/Amanda/Holding.pm b/perl/Amanda/Holding.pm
index 88f1fd8..5b550c9 100644
--- a/perl/Amanda/Holding.pm
+++ b/perl/Amanda/Holding.pm
@@ -199,7 +199,7 @@ sub _is_datestr {
 }
 
 sub _walk {
-my ($file_fn, $verbose) = @_;
+my ($file_fn, $verbose, $take_pid) = @_;
 
 # walk disks, directories, and files with nested loops
 for my $disk (disks()) {
@@ -229,6 +229,21 @@ sub _walk {
 		next;
 	}
 
+	my $pidfn = File::Spec->catfile($dirfn, "pid");
+	if (open(my $pidh,  $pidfn)) {
+		my $pid = <$pidh>;
+		if (kill($pid, 0) == 0) {
+		# 

Re: amflush run while amdump underway tries to flush .tmp files

2017-11-19 Thread Jose M Calhariz
Hi,

I have a user that have reported this problem and with a suspection that
amflush could flush the temporary files.

I leave here the link to the bugreport:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881754

Kind regards
Jose M Calhariz


On Fri, Nov 17, 2017 at 12:38:12AM -0500, Nathan Stratton Treadway wrote:
> (Amanda v3.5)
> 
> I noticed that Amanda 3.5 no longer aborts amflush if amdump is
> currently running (as older versions of Amanda do). 
> 
> So out of curiousity I kicked of "amflush TestBackup" while amdump was
> busy dumping to the holding disk... and I discovered that amflush
> actually tries to go ahead and flush the ".tmp" file files that it finds
> in the holding directory:
> 
> = From /var/log/amanda/server/TestBackup/amflush.20171116200510.debug:
> Thu Nov 16 20:05:17.590062176 2017: pid 26860: thd-0x2f07e00: amflush: 
> flushing /amanda/TestBackup-holding/2017111622/client1._.0.tmp
> Thu Nov 16 20:05:17.590096226 2017: pid 26860: thd-0x2f07e00: amflush: 
> flushing /amanda/TestBackup-holding/2017111622/client2._.1.tmp
> =
> 
> In this case the taper failed (and thus the amflush didn't actually do
> anything with the .tmp files):
> 
> = From /var/log/amanda/TestBackup/amdump.1:
> driver: send-cmd time 14.132 to taper0: FILE-WRITE worker0-0 00-2 
> /amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 0 
> 2017111622 "" "" "" "" "" "" "" "" 0
> writing taper command 'FILE-WRITE worker0-0 00-2 
> /amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 0 
> 2017111622 "" "" "" "" "" "" "" "" 0
> ' failed: Broken pipe
> =
> (There is a line break in the log file just before "' failed".)
> 
> 
> = From /var/log/amanda/server/TestBackup/taper.20171116200517.debug:
> Thu Nov 16 20:05:18.502419601 2017: pid 26862: thd-0x4232000: taper: Building 
> type SPLIT_FILE header of 32768-32768 bytes with name='client1' disk='/' 
> dumplevel=0 and blocksize=32768
> Thu Nov 16 20:05:22.427709157 2017: pid 26862: thd-0x4232050: taper: no 
> next_filename
> Thu Nov 16 20:05:22.427743969 2017: pid 26862: thd-0x4232050: taper: sending 
> XMSG_CRC message
> Thu Nov 16 20:05:22.427748905 2017: pid 26862: thd-0x4232050: taper: 
> xfer-source-holding CRC: 2e4f7128 size: 249856000
> Thu Nov 16 20:05:22.427757739 2017: pid 26862: thd-0x4232050: taper: 
> xfer_queue_message: MSG:  elt= version=0>
> Thu Nov 16 20:05:22.427767783 2017: pid 26862: thd-0x4232050: taper: 
> xfer-source-holding sending XMSG_DONE message
> Thu Nov 16 20:05:22.427773216 2017: pid 26862: thd-0x4232050: taper: 
> xfer_queue_message: MSG:  elt= version=0>
> [ *** file ends abruptly here ***]
> =
> 
> 
>  but whether or not that indicates a bug in the taper, it seems like
> amflush should not ever try to flush .tmp files from the holding disk...
> (right?)
> 
> 
> 
> Finally, after this testing I notice that the command_file still has
> FLUSH commands for those .tmp files (even though neither the files nor
> the containing holding directory now exist).  I've run both "amdump" and
> "amflush" since then, and tried "amcleanup" as well.  Is there any
> (good) way to clean up these orphan commands?
> 
> = From /etc/amanda/TestBackup/command_file:
> ID 1633
> 1603 FLUSH TestBackup 
> /amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 
> 2017111622 0 TestBackup WORKING:17072 TODO
> 1604 FLUSH TestBackup 
> /amanda/TestBackup-holding/2017111622/client2._.1.tmp client2 / 
> 2017111622 0 TestBackup WORKING:17072 TODO
> =
> 
> =
> # ls -l /amanda/TestBackup-holding/
> total 0
> =
> 
> 
>   Nathan
> 
> 
> Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
> Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
>  GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
>  Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239
> 

-- 
--
Todo homem luta com mais bravura por seus interesses do que
seus direitos.
-- Napoleao


signature.asc
Description: PGP signature


amflush run while amdump underway tries to flush .tmp files

2017-11-16 Thread Nathan Stratton Treadway
(Amanda v3.5)

I noticed that Amanda 3.5 no longer aborts amflush if amdump is
currently running (as older versions of Amanda do). 

So out of curiousity I kicked of "amflush TestBackup" while amdump was
busy dumping to the holding disk... and I discovered that amflush
actually tries to go ahead and flush the ".tmp" file files that it finds
in the holding directory:

= From /var/log/amanda/server/TestBackup/amflush.20171116200510.debug:
Thu Nov 16 20:05:17.590062176 2017: pid 26860: thd-0x2f07e00: amflush: flushing 
/amanda/TestBackup-holding/2017111622/client1._.0.tmp
Thu Nov 16 20:05:17.590096226 2017: pid 26860: thd-0x2f07e00: amflush: flushing 
/amanda/TestBackup-holding/2017111622/client2._.1.tmp
=

In this case the taper failed (and thus the amflush didn't actually do
anything with the .tmp files):

= From /var/log/amanda/TestBackup/amdump.1:
driver: send-cmd time 14.132 to taper0: FILE-WRITE worker0-0 00-2 
/amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 0 
2017111622 "" "" "" "" "" "" "" "" 0
writing taper command 'FILE-WRITE worker0-0 00-2 
/amanda/TestBackup-holding/2017111622/client1._.0.tmp client1 / 0 
2017111622 "" "" "" "" "" "" "" "" 0
' failed: Broken pipe
=
(There is a line break in the log file just before "' failed".)


= From /var/log/amanda/server/TestBackup/taper.20171116200517.debug:
Thu Nov 16 20:05:18.502419601 2017: pid 26862: thd-0x4232000: taper: Building 
type SPLIT_FILE header of 32768-32768 bytes with name='client1' disk='/' 
dumplevel=0 and blocksize=32768
Thu Nov 16 20:05:22.427709157 2017: pid 26862: thd-0x4232050: taper: no 
next_filename
Thu Nov 16 20:05:22.427743969 2017: pid 26862: thd-0x4232050: taper: sending 
XMSG_CRC message
Thu Nov 16 20:05:22.427748905 2017: pid 26862: thd-0x4232050: taper: 
xfer-source-holding CRC: 2e4f7128 size: 249856000
Thu Nov 16 20:05:22.427757739 2017: pid 26862: thd-0x4232050: taper: 
xfer_queue_message: MSG:  version=0>
Thu Nov 16 20:05:22.427767783 2017: pid 26862: thd-0x4232050: taper: 
xfer-source-holding sending XMSG_DONE message
Thu Nov 16 20:05:22.427773216 2017: pid 26862: thd-0x4232050: taper: 
xfer_queue_message: MSG:  version=0>
[ *** file ends abruptly here ***]
=


 but whether or not that indicates a bug in the taper, it seems like
amflush should not ever try to flush .tmp files from the holding disk...
(right?)



Finally, after this testing I notice that the command_file still has
FLUSH commands for those .tmp files (even though neither the files nor
the containing holding directory now exist).  I've run both "amdump" and
"amflush" since then, and tried "amcleanup" as well.  Is there any
(good) way to clean up these orphan commands?

= From /etc/amanda/TestBackup/command_file:
ID 1633
1603 FLUSH TestBackup /amanda/TestBackup-holding/2017111622/client1._.0.tmp 
client1 / 2017111622 0 TestBackup WORKING:17072 TODO
1604 FLUSH TestBackup /amanda/TestBackup-holding/2017111622/client2._.1.tmp 
client2 / 2017111622 0 TestBackup WORKING:17072 TODO
=

=
# ls -l /amanda/TestBackup-holding/
total 0
=


Nathan


Nathan Stratton Treadway  -  natha...@ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239