Re: Backup issues with OpenBSD 4.5 machines
On Mon, Sep 28, 2009 at 9:42 PM, Nathan Stratton Treadway natha...@ontko.com wrote: As you say, having this extra call shouldn't really hurt anything, but I wondering if it would make sense to tweak the #ifdef so that is possible to compile Amanda on OpenBSD without having to include the work-around even in environments where it's not necessary (e.g. when using the new theading library you mention)? I wouldn't mind doing this, but it's only six fcntl() calls on startup, so the cost really is negligible. (For example, does Configure generate any particular #define related to the use of the pthread library?) I don't know a good way to do this -- the best I can think of is to check GLIB_LDFLAGS for /pthread/, but I don't know if OpenBSD compiles require that pthread be listed explicitly on the link line when compiling against libgthread. It would help if I had an OpenBSD system handy.. When a version of OpenBSD is released where this is no longer an issue (5.0?), we can conditionalize it on that version. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
I added the following lines to util.c as a temporary fix: #define DATA_FD_COUNT 3 /* number of general-use pipes */ #define DATA_FD_OFFSET 50 I ran three test backups, and each succeeded: - forced full, without compression, no index - forced full, with compression and index - unforced, with compression and index Thanks, Michael On Mon, Sep 28, 2009 at 2:46 PM, Michael Burk bur...@gmail.com wrote: Hi Dustin, Great analysis; thanks for sharing the details of the problem. On Mon, Sep 28, 2009 at 12:11 PM, Dustin J. Mitchell dus...@zmanda.com wrote: while putting together the patch (attached), I patched the 0928 snapshot, but it didn't compile: util.c: In function `openbsd_fd_inform': util.c:1290: error: `DATA_FD_OFFSET' undeclared (first use in this function) util.c:1290: error: (Each undeclared identifier is reported only once util.c:1290: error: for each function it appears in.) util.c:1290: error: `DATA_FD_COUNT' undeclared (first use in this function) I tried adding amandad.h to the file, but that didn't help. Did I patch the wrong version? Thanks Michael
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 29, 2009 at 12:55 PM, Michael Burk bur...@gmail.com wrote: I added the following lines to util.c as a temporary fix: #define DATA_FD_COUNT 3 /* number of general-use pipes */ #define DATA_FD_OFFSET 50 I ran three test backups, and each succeeded: - forced full, without compression, no index - forced full, with compression and index - unforced, with compression and index Thanks! I'll fix up the compile problem -- not sure how it worked for me :) Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 29, 2009 at 3:10 PM, Dustin J. Mitchell dus...@zmanda.com wrote: Thanks! I'll fix up the compile problem -- not sure how it worked for me :) Duh, it worked for me because I'm not running OpenBSD.. Anyway, fixed in r2149. Thanks to everyone on this 100+-message thread! Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Sep 28, 2009 at 1:48 PM, Dustin J. Mitchell dus...@zmanda.com wrote: Option 1 would be temporary -- eventually, I would like to be able to use threads on clients, to support compression and encryption, for example. Option 1 is also harder than it sounds -- futzing with the build process is like playing whack-a-mole, where any change causes problems on another platform. Apologies for replying to myself, but while putting together the patch (attached), I noticed that this will affect the server, too -- amandad is the gateway for amrecover to get access to data for recoveries. So option 1 is not really feasible. Dustin -- Open Source Storage Engineer http://www.zmanda.com help out OpenBSD's pthreads (f369fd11620b09b5efb71eb05ce088fa326b4082) client-src/noop.c|1 + client-src/selfcheck.c |1 + client-src/sendbackup.c |1 + client-src/sendsize.c|1 + common-src/util.c| 12 common-src/util.h| 10 ++ restore-src/amidxtaped.c |1 + server-src/amindexd.c|1 + 8 files changed, 28 insertions(+), 0 deletions(-) diff --git a/client-src/noop.c b/client-src/noop.c index 1edf46f..e9b6270 100644 --- a/client-src/noop.c +++ b/client-src/noop.c @@ -65,6 +65,7 @@ main( signal(SIGPIPE, SIG_IGN); safe_fd(-1, 0); +openbsd_fd_inform(); check_running_as(RUNNING_AS_CLIENT_LOGIN); diff --git a/client-src/selfcheck.c b/client-src/selfcheck.c index 1f22717..ba1ee50 100644 --- a/client-src/selfcheck.c +++ b/client-src/selfcheck.c @@ -108,6 +108,7 @@ main( textdomain(amanda); safe_fd(-1, 0); +openbsd_fd_inform(); safe_cd(); set_pname(selfcheck); diff --git a/client-src/sendbackup.c b/client-src/sendbackup.c index fca1f98..219ebf6 100644 --- a/client-src/sendbackup.c +++ b/client-src/sendbackup.c @@ -137,6 +137,7 @@ main( textdomain(amanda); safe_fd(DATA_FD_OFFSET, DATA_FD_COUNT*2); +openbsd_fd_inform(); safe_cd(); diff --git a/client-src/sendsize.c b/client-src/sendsize.c index e4d589d..165c363 100644 --- a/client-src/sendsize.c +++ b/client-src/sendsize.c @@ -151,6 +151,7 @@ main( textdomain(amanda); safe_fd(-1, 0); +openbsd_fd_inform(); safe_cd(); set_pname(sendsize); diff --git a/common-src/util.c b/common-src/util.c index 9ccd9fb..d173e34 100644 --- a/common-src/util.c +++ b/common-src/util.c @@ -1282,3 +1282,15 @@ get_pcontext(void) return pcontext; } +#ifdef __OpenBSD__ +void +openbsd_fd_inform(void) +{ +int i; +for (i = DATA_FD_OFFSET; i DATA_FD_OFFSET + DATA_FD_COUNT*2; i++) { + /* a simple fcntl() will cause the library to look at this file +* descriptor, which is good enough */ + (void)fcntl(i, F_GETFL); +} +} +#endif diff --git a/common-src/util.h b/common-src/util.h index 8604cf1..f74910a 100644 --- a/common-src/util.h +++ b/common-src/util.h @@ -348,4 +348,14 @@ void proplist_add_to_argv(gpointer key_p, gpointer user_data_p); +/* Inform the OpenBSD pthread library about the high-numbered file descriptors + * that an amandad service inherits. This won't be necessary once the new + * threading library is availble (OpenBSD 5.0?), but won't hurt anyway. See the + * thread Backup issues with OpenBSD 4.5 machines from September 2009. */ +#ifdef __OpenBSD__ +void openbsd_fd_inform(void); +#else +#define openbsd_fd_inform() +#endif + #endif /* UTIL_H */ diff --git a/restore-src/amidxtaped.c b/restore-src/amidxtaped.c index 3b5b904..42aa369 100644 --- a/restore-src/amidxtaped.c +++ b/restore-src/amidxtaped.c @@ -261,6 +261,7 @@ main( textdomain(amanda); safe_fd(DATA_FD_OFFSET, 4); +openbsd_fd_inform(); safe_cd(); /* Don't die when child closes pipe */ diff --git a/server-src/amindexd.c b/server-src/amindexd.c index e2d56d5..98e3c28 100644 --- a/server-src/amindexd.c +++ b/server-src/amindexd.c @@ -1251,6 +1251,7 @@ main( textdomain(amanda); safe_fd(DATA_FD_OFFSET, 2); +openbsd_fd_inform(); safe_cd(); /*
Re: Backup issues with OpenBSD 4.5 machines
OK, I have a more in-depth summary of exactly what's going on here, and why the fcntl() calls fix it. The good news: we've stumbled on a pretty stable fix for this problem. As background, the Amanda client operates something like this: amandad is invoked by (x)inetd or some other mechanism amandad uses the Amanda protocol to determine which service the server is requesting (sendbackup in this case) amandad forks and executes that service, after setting up a bunch of pipes for it. The unusual thing is that, aside from the usual stdin/stdout/stderr (fd's 0-2), amandad sets up six pipes at hard-wired file descriptors 50-55, and sendbackup uses those to send the data, index, and message streams back to the server. As background on POSIX: Multiple processes can hold the same file open at the same time. This is how, for example, a backgrounded process in the shell can share your terminal with the shell itself. Each of those processes would like to access the file either in blocking mode (waiting for data to be available) or nonblocking mode (immediately returning when no data is available, to allow the application to work on something else while waiting). Unfortunately, POSIX specifies that the file *itself* carries the O_NONBLOCK flag, so it is not specific to the application. In the case at hand, Amanda is accessing a particular pipe in nonblocking mode (for reasons explained below), while gzip expects it to be in blocking mode, and this leads to the EAGAIN that is killing gzip. As background on the OpenBSD pthreads (or, more accurately, uthreads -- lib/libpthread/uthread): This library shims its way between an application and the kernel, and implements blocking threaded operations on file descriptors using nonblocking kernel operations and a select() loop. In order to do so, it must set O_NONBLOCK on every file it accesses. This is easily accomplished by wrapping open(), pipe(), dup(), dup2(), socket(), and so on -- the syscalls which create new file descriptors. However, inherited file descriptors -- those opened by the parent before calling execve() -- are a little bit harder, because the library has no way to know about them. It hides this O_NONBLOCK flag from the application by masking it out of the fcntl() return value. The solution that uthreads uses is to start tracking a file the first time it is referenced in a syscall (in _thread_fd_lock, to be precise). It sets O_NONBLOCK when it starts tracking the file, and then removes the flag at the appropriate time (execve, in particular). In the failure mode, uthreads finds out about the index file *after* it has forked the gzip child, when sendbackup tries to close the file descriptor. Due to the design of the library, it carefully sets O_NONBLOCK before closing the file descriptor, leading gzip to get an EAGAIN error. The mysterious fcntl() calls, however, serve as a warning to uthreads that the index file exists. Uthreads sets the O_NONBLOCK flag when performing the fcntl(), but then clears it on execve(), so everything works as expected. So, there are really two fixes available, until OpenBSD's new threading library is available: 1. don't link Amanda client libraries with threading libraries 2. inform uthreads of the high-numbered FD's in all of the service binaries, using fcntl() Option 1 would be temporary -- eventually, I would like to be able to use threads on clients, to support compression and encryption, for example. Option 1 is also harder than it sounds -- futzing with the build process is like playing whack-a-mole, where any change causes problems on another platform. From my analysis above, option 2 is fairly robust (as robust as OpenBSD's pthreads, anyway), and won't cause any trouble on systems with non-buggy threading libraries. So I'm leaning that direction. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
Hi Dustin, Great analysis; thanks for sharing the details of the problem. On Mon, Sep 28, 2009 at 12:11 PM, Dustin J. Mitchell dus...@zmanda.com wrote: while putting together the patch (attached), I patched the 0928 snapshot, but it didn't compile: util.c: In function `openbsd_fd_inform': util.c:1290: error: `DATA_FD_OFFSET' undeclared (first use in this function) util.c:1290: error: (Each undeclared identifier is reported only once util.c:1290: error: for each function it appears in.) util.c:1290: error: `DATA_FD_COUNT' undeclared (first use in this function) I tried adding amandad.h to the file, but that didn't help. Did I patch the wrong version? Thanks Michael
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Sep 28, 2009 at 13:48:37 -0400, Dustin J. Mitchell wrote: The mysterious fcntl() calls, however, serve as a warning to uthreads that the index file exists. Uthreads sets the O_NONBLOCK flag when performing the fcntl(), but then clears it on execve(), so everything works as expected. Congrats on tracking down all that! :) One followup, question, though: Early on in this thread (Aug 30 and 31), Michael reported that one of the early patches that Jean-Louis sent to him (or to Stan?) did fix the original gzip: stdout: Resource temporarily unavailable message... but then there was instead a similar error from the sed process that's part of the index-generation pipeline. (Based on the followup discussion, it seems that early patch only touched the handle for the data file, though I am not sure exact where/when.) However, a bit later on (Sept 4), Michael reported that the one-line patch that touched just datafd (in sendbackup.c/main(), immediately after values are assigned to datafd/mesgfd/indexfd) was enough to allow his dumps to complete without error, no matter what combination of indexing and compression he had configured. Now that you have figured out that this is all, in fact, related to pthreads, do you understand why this later patch was able to eliminate the errors from the sed process as well? Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239 signature.asc Description: Digital signature
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Sep 28, 2009 at 14:11:51 -0400, Dustin J. Mitchell wrote: +++ b/common-src/util.c @@ -1282,3 +1282,15 @@ get_pcontext(void) return pcontext; } +#ifdef __OpenBSD__ [...] diff --git a/common-src/util.h b/common-src/util.h index 8604cf1..f74910a 100644 --- a/common-src/util.h +++ b/common-src/util.h @@ -348,4 +348,14 @@ void proplist_add_to_argv(gpointer key_p, gpointer user_data_p); +/* Inform the OpenBSD pthread library about the high-numbered file descriptors + * that an amandad service inherits. This won't be necessary once the new + * threading library is availble (OpenBSD 5.0?), but won't hurt anyway. See the + * thread Backup issues with OpenBSD 4.5 machines from September 2009. */ +#ifdef __OpenBSD__ +void openbsd_fd_inform(void); +#else [...] As you say, having this extra call shouldn't really hurt anything, but I wondering if it would make sense to tweak the #ifdef so that is possible to compile Amanda on OpenBSD without having to include the work-around even in environments where it's not necessary (e.g. when using the new theading library you mention)? (For example, does Configure generate any particular #define related to the use of the pthread library?) Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239 signature.asc Description: Digital signature
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 8, 2009 at 10:20 PM, Nathan Stratton Treadwaynatha...@ontko.com wrote: At that time the discussion was focused on common-src/stream.c, which hadn't changed significantly between those versions, but it would be interesting to know if there were any changes in the sendbackup code path after 2.5.0p1 that seem like they might be related to this bug Sigh, that reaches way back into the CVS days and Jean-Louis' memory -- he's not around right now. There were *bunches* of changes between 2.5.0 and 2.5.2, not least of which the introduction of the Backup API (which became the Application API). If we could reproduce the behavior in a test program, exploring the history might be a more fruitful avenue for pursuit, since we'd know what to look for. One thing we could do is remove all of the other patches and replace the fixing fcntl() calls with something similar -- maybe just sleep(1)? If this fixes the problem, then that's evidence this was a race condition. Alternately, remove all of the other patches and add: 1285 ptr = buffer; 1286 bytes_written = 0; g_debug(fcntl(%d) = %d, fileno(pipe_fp), fcntl(fileno(pipe_fp), F_GETFL, 0), O_NONBLOCK); 1287 just_written = full_write(fileno(pipe_fp), ptr, (size_t)bytes_read); 1288 if (just_written (size_t)bytes_read) { and try to trigger the same error. I suspect you'll find that the fcntl is returning something with bit 2 (O_NONBLOCK) set. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
Hi Dustin - one question before I post on the OpenBSD list. In reviewing a post Stan made a couple weeks ago to the OpenBSD list, someone asked if Amanda uses pthreads. I noticed that ldd reports that the binaries link to libpthread. Does Amanda use pthreads, either directly or through some other library? -- Michael On Wed, Sep 9, 2009 at 7:31 AM, Dustin J. Mitchell dus...@zmanda.comwrote: On Tue, Sep 8, 2009 at 10:20 PM, Nathan Stratton Treadwaynatha...@ontko.com wrote: At that time the discussion was focused on common-src/stream.c, which hadn't changed significantly between those versions, but it would be interesting to know if there were any changes in the sendbackup code path after 2.5.0p1 that seem like they might be related to this bug Sigh, that reaches way back into the CVS days and Jean-Louis' memory -- he's not around right now. There were *bunches* of changes between 2.5.0 and 2.5.2, not least of which the introduction of the Backup API (which became the Application API). If we could reproduce the behavior in a test program, exploring the history might be a more fruitful avenue for pursuit, since we'd know what to look for. One thing we could do is remove all of the other patches and replace the fixing fcntl() calls with something similar -- maybe just sleep(1)? If this fixes the problem, then that's evidence this was a race condition. Alternately, remove all of the other patches and add: 1285 ptr = buffer; 1286 bytes_written = 0; g_debug(fcntl(%d) = %d, fileno(pipe_fp), fcntl(fileno(pipe_fp), F_GETFL, 0), O_NONBLOCK); 1287 just_written = full_write(fileno(pipe_fp), ptr, (size_t)bytes_read); 1288 if (just_written (size_t)bytes_read) { and try to trigger the same error. I suspect you'll find that the fcntl is returning something with bit 2 (O_NONBLOCK) set. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Wed, Sep 9, 2009 at 11:57 AM, Michael Burkbur...@gmail.com wrote: Hi Dustin - one question before I post on the OpenBSD list. In reviewing a post Stan made a couple weeks ago to the OpenBSD list, someone asked if Amanda uses pthreads. I noticed that ldd reports that the binaries link to libpthread. Does Amanda use pthreads, either directly or through some other library? Yes, for uses = links against in the particular cases of amandad and sendbackup. No client-side processes ever run more than one thread, though. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 8, 2009 at 11:45 AM, Michael Burkbur...@gmail.com wrote: Thanks Dustin. Attached are the gory details. What about the amandad logfile? Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 8, 2009 at 12:42 PM, Michael Burkbur...@gmail.com wrote: Sorry - here they are. OK, I don't see anything funny there. I think it's finally time to take this to the OpenBSD list and see if they can find anything in this wilderness. It might help to provide a pointer to the archive of this (now *very* long) thread. I took a look at the kernel source.. EAGAIN only appears in a few places, and from what I can tell is adequately protected by O_NONBLOCK/FNONBLOCK in sys_pipe.c. Looking at sys_fcntl, the F_GETFL path is very simple and doesn't do any kind of caching that might change behavior. At this point, I'm guessing that this may be a race condition, and that the fcntl() calls are just enough to change the winner of the race. File flags are global to the open file -- they are not duplicated on a fork, and do not track with a particular process. So, hypothetically, if amandad opens a file descriptor, hands it off to sendbackup, and then sets it to nonblocking mode, then we would see this behavior. The only wrinkle in this theory is that amandad very quickly closes its copy of the file, and doesn't ever set it to nonblocking mode. At this point, I'm out of options. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
OK, sorry for the delay. Attached is a patch, also at http://github.com/djmitche/amanda/commit/435ed5a820819188578df4a8a730a6b084a9f29f.patch which adds a whole bunch of debugging. I'd like to hear how this works, and to see the sendbackup and amandad debug log files. This should elucidate any weird file-descriptor stuff going on. Dustin -- Open Source Storage Engineer http://www.zmanda.com lots_of_debugging.patch Description: Binary data
Re: Backup issues with OpenBSD 4.5 machines
On Wed, Sep 02, 2009 at 11:11:39 -0600, Michael Burk wrote: This was a good idea; I tried it with one modification: I determined earlier that the failure happens without indexing also, so I added just the line: fcntl(datafd, F_GETFL, 0); and that fixed the problem as well. So I guess this is truly the minimal patch! Are you referring back to the results you posted to here (to amanda-users) on Sept 1? (That is, this message: http://www.mail-archive.com/amanda-users@amanda.org/msg42749.html ) I take it you are concluding that the End of ape detected message in that log file is actually caused by the same underlying resource temporarily unavailable error on the pipe as the the index tee cannot write message you originally reported? (It does seem to show up in the same spot during the dump, right after Pass III starts.) Did you try running a dump with the above one-line patch applied and indexing turned off? Nathan Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
Attached is a test program I just put together which does about what Jean-Louis specified above (with the addition of some closed fd's). This works fine on my mac, which is the closest approximation to OpenBSD I have access to at the moment. How does it work on 4.5? The file-descriptor gymnastics that amandad performs is pretty crazy, so it's quite possible I've misread the code and it's doing something stupid, but please let me know what this test program does. Second, in Jean-Louis' minimal patch, can you change it to read g_debug(first, fcntl returns %d; O_NONBLOCK=%d, fcntl(datafd, F_GETFL, 0), O_NONBLOCK); g_debug(and then fcntl returns %d; O_NONBLOCK=%d, fcntl(datafd, F_GETFL, 0), O_NONBLOCK); and let me know what the corresponding lines in the debug file say? It will be interesting to know if the descriptor flags have O_NONBLOCK set either time. We'll get to the bottom of this :) Dustin -- Open Source Storage Engineer http://www.zmanda.com test.c Description: Binary data
Re: Backup issues with OpenBSD 4.5 machines
On Wed, Sep 02, 2009 at 11:11:39 -0600, Michael Burk wrote: This was a good idea; I tried it with one modification: I determined earlier that the failure happens without indexing also, so I added just the line: fcntl(datafd, F_GETFL, 0); and that fixed the problem as well. So I guess this is truly the minimal patch! I tried to understand the program flow for sendbackup.c and sendbackup_dump.c, and I agree with Dustin that it is pretty crazy, but as far as I can tell the only place that would generate the message index tee cannot write [ is from the following code found in the start_index() function in sendbackup.c: just_written = full_write(3, ptr, bytes_read); if (just_written (size_t)bytes_read) { error(_(index tee cannot write [%s]), strerror(errno)); /*NOTREACHED*/ } else { Tracing backwards from there, it appears that the file descriptor #3 in this chunck of code will point to the same file as the datafd referenced in the patch you have been trying only if client-side compression is not turned on. If it is turned on, it looks like fd #3 will instead point to the input side of the process that is doing the compression. So, if other avenues of research don't lead to a solution first, it might be interesting to try turning on client side compression (both with and without the one-line patch applied), to see if a similar error occurs in that case (and if so, whether the error is still in the index- tee-write operation or if the compression process complains instead). Also, when switched to your one-line patch (where you call fnctl() on datafd only), did you start getting the strange(?): sed: stdout: Resource temporarily unavailable messages in the log file again? Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Sep 4, 2009 at 4:57 PM, Michael Burkbur...@gmail.com wrote: Thanks again for your help. Here's the output of the test prog: Looks just like it does locally. If the test had managed to reproduce this failure, then I would have expected to see write: Resource temporarily unavailable That means something deeper is going on here. And here's the output after the patch change (again with the 0831 snapshot): And this shows me that, as far as the kernel is concerned, O_NONBLOCK is not set on the file descriptor even on the first call. Since my test program hasn't replicated the error, I think we should explore a bit more before going to the openbsd lists. What I'd like to put together is a patch to amandad and sendbackup that posts a debug message about every dup(), dup2(), pipe(), fcntl(), and close() operation that they perform. I'll whip that up and send it along in a few minutes. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
Hi Dustin, Thanks again for your help. Here's the output of the test prog: bu...@selenium$ ./test parent pipe = r...@3 w...@4 parent closing parent sleeping child closing p[0] child exec'ing child duping child closing B child writing child write done parent reading parent got 4 bytes bu...@selenium$ And here's the output after the patch change (again with the 0831 snapshot): 1252076894.258118: sendbackup: pid 23986 ruid 150 euid 150 version 2.6.2alpha: start at Fri Sep 4 09:08:14 2009 1252076894.258292: sendbackup: Version 2.6.2alpha 1252076894.280268: sendbackup: pid 23986 ruid 150 euid 150 version 2.6.2alpha: rename at Fri Sep 4 09:08:14 2009 1252076894.280787: sendbackup: Parsed request as: program `DUMP' 1252076894.280799: sendbackup: disk `/' 1252076894.280808: sendbackup: device `/' 1252076894.280818: sendbackup: level 0 1252076894.280827: sendbackup: since NODATE 1252076894.280836: sendbackup: options `' 1252076894.280909: sendbackup: first, fcntl returns 2; O_NONBLOCK=4 1252076894.280933: sendbackup: and then fcntl returns 2; O_NONBLOCK=4 1252076894.281309: sendbackup: start: selenium.___.com:/ lev 0 1252076894.281838: sendbackup: dumping device '/dev/rsd0a' with 'ffs' 1252076894.281943: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0a in pipeline 1252076894.282765: sendbackup: Started backup 1252076894.286940: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Fri Sep 4 09:08:14 2009 1252076894.288165: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1252076894.350837: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0a (/) to standard output 1252076894.352274: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1252076895.520191: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1252076895.520782: sendbackup: 90: normal(|): DUMP: estimated 47910 tape blocks. 1252076895.522299: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Fri Sep 4 09:08:15 2009 1252076895.522802: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1252076895.669127: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1252076906.028912: sendbackup: 43:size(|): DUMP: 48385 tape blocks 1252076906.029606: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Fri Sep 4 09:08:14 2009 1252076906.030489: sendbackup: 90: normal(|): DUMP: Volume 1 completed at: Fri Sep 4 09:08:26 2009 1252076906.030896: sendbackup: 90: normal(|): DUMP: Volume 1 took 0:00:11 1252076906.031250: sendbackup: 90: normal(|): DUMP: Volume 1 transfer rate: 4398 KB/s 1252076906.031614: sendbackup: 90: normal(|): DUMP: Date this dump completed: Fri Sep 4 09:08:26 2009 1252076906.031962: sendbackup: 90: normal(|): DUMP: Average transfer rate: 4398 KB/s 1252076906.032322: sendbackup: 90: normal(|): DUMP: level 0 dump on Fri Sep 4 09:08:14 2009 1252076906.032667: sendbackup: 90: normal(|): DUMP: DUMP IS DONE 1252076906.032777: sendbackup: Parsed backup messages 1252076906.032848: sendbackup: pid 23986 finish time Fri Sep 4 09:08:26 2009 Thanks, Michael On Fri, Sep 4, 2009 at 11:11 AM, Dustin J. Mitchell dus...@zmanda.comwrote: Attached is a test program I just put together which does about what Jean-Louis specified above (with the addition of some closed fd's). This works fine on my mac, which is the closest approximation to OpenBSD I have access to at the moment. How does it work on 4.5? The file-descriptor gymnastics that amandad performs is pretty crazy, so it's quite possible I've misread the code and it's doing something stupid, but please let me know what this test program does. Second, in Jean-Louis' minimal patch, can you change it to read g_debug(first, fcntl returns %d; O_NONBLOCK=%d, fcntl(datafd, F_GETFL, 0), O_NONBLOCK); g_debug(and then fcntl returns %d; O_NONBLOCK=%d, fcntl(datafd, F_GETFL, 0), O_NONBLOCK); and let me know what the corresponding lines in the debug file say? It will be interesting to know if the descriptor flags have O_NONBLOCK set either time. We'll get to the bottom of this :) Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Sep 04, 2009 at 17:13:19 -0400, Dustin J. Mitchell wrote: On Fri, Sep 4, 2009 at 4:57 PM, Michael Burkbur...@gmail.com wrote: Thanks again for your help. Here's the output of the test prog: Looks just like it does locally. If the test had managed to reproduce this failure, then I would have expected to see write: Resource temporarily unavailable That means something deeper is going on here. I wonder if more data needs to get sent down the pipe before the unexpected Resource temporarily unavailable result is returned? And here's the output after the patch change (again with the 0831 snapshot): And this shows me that, as far as the kernel is concerned, O_NONBLOCK is not set on the file descriptor even on the first call. Since my test program hasn't replicated the error, I think we should explore a bit more before going to the openbsd lists. What I'd like to put together is a patch to amandad and sendbackup that posts a debug message about every dup(), dup2(), pipe(), fcntl(), and close() operation that they perform. I'll whip that up and send it along in a few minutes. If I have followed this thread correctly, it seems like the tests show that the kernel treating the file descriptor as if O_NONBLOCK is set even when it's not... but seems to revert to the correct behavior if fcntl(XX, F_GETFL, 0) is called. If just running a F_GETFL operation changes the behavior of a file handle, it seems to point to a kernel bug of some kind But the tracing you are talking about, combined with moving the F_GETFL operation to different stages of the program flow may help narrow down which point in the file-descriptor gymnastics is triggering that bug. For example, if you found that adding fcntl(3, F_GETFL, 0); right after the dup2(input, 3); line in that in sendbackup.c:start_index() behaved differently than adding fcntl(input, F_GETFL, 0); in that same spot, that would tend to point to something going wrong in dup2 Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
Nathan, I think the following truth table will answer all your questions! (view in monospace font) Patch gzipIndex Result no no no dump failed (end of tape) no no yes index tee cannot write no yes no gzip strange, dump failed no yes yes gzip strange, index tee strange, dump failed yes * * success, nothing strange On Fri, Sep 4, 2009 at 3:01 PM, Nathan Stratton Treadway natha...@ontko.com wrote: On Wed, Sep 02, 2009 at 11:11:39 -0600, Michael Burk wrote: This was a good idea; I tried it with one modification: I determined earlier that the failure happens without indexing also, so I added just the line: fcntl(datafd, F_GETFL, 0); and that fixed the problem as well. So I guess this is truly the minimal patch! I tried to understand the program flow for sendbackup.c and sendbackup_dump.c, and I agree with Dustin that it is pretty crazy, but as far as I can tell the only place that would generate the message index tee cannot write [ is from the following code found in the start_index() function in sendbackup.c: just_written = full_write(3, ptr, bytes_read); if (just_written (size_t)bytes_read) { error(_(index tee cannot write [%s]), strerror(errno)); /*NOTREACHED*/ } else { Tracing backwards from there, it appears that the file descriptor #3 in this chunck of code will point to the same file as the datafd referenced in the patch you have been trying only if client-side compression is not turned on. If it is turned on, it looks like fd #3 will instead point to the input side of the process that is doing the compression. So, if other avenues of research don't lead to a solution first, it might be interesting to try turning on client side compression (both with and without the one-line patch applied), to see if a similar error occurs in that case (and if so, whether the error is still in the index- tee-write operation or if the compression process complains instead). Also, when switched to your one-line patch (where you call fnctl() on datafd only), did you start getting the strange(?): sed: stdout: Resource temporarily unavailable messages in the log file again? Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txthttp://www.ontko.com/%7Enathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Sep 04, 2009 at 16:41:31 -0600, Michael Burk wrote: Nathan, I think the following truth table will answer all your questions! (view in monospace font) Patch gzipIndex Result no no no dump failed (end of tape) no no yes index tee cannot write no yes no gzip strange, dump failed no yes yes gzip strange, index tee strange, dump failed yes * * success, nothing strange Most of them :) Am I guessing correctly that the Patch here was one of the earlier patches that touched all three file descriptors? And that when you tried the one-line patch (that touched only datafd), your dump succeeded, but sed complained about stdout again? Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
Sorry I didn't make that clear. The patch was the one-liner (datafd only). I ran it again with the Dustin's 2-liner with the same results. BTW, I ran these latest tests (all 8 runs) on the 0904 snapshot. It builds cleanly on OpenBSD now. -- Michael On Fri, Sep 4, 2009 at 4:47 PM, Nathan Stratton Treadway natha...@ontko.com wrote: On Fri, Sep 04, 2009 at 16:41:31 -0600, Michael Burk wrote: Nathan, I think the following truth table will answer all your questions! (view in monospace font) Patch gzipIndex Result no no no dump failed (end of tape) no no yes index tee cannot write no yes no gzip strange, dump failed no yes yes gzip strange, index tee strange, dump failed yes * * success, nothing strange Most of them :) Am I guessing correctly that the Patch here was one of the earlier patches that touched all three file descriptors? And that when you tried the one-line patch (that touched only datafd), your dump succeeded, but sed complained about stdout again? Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txthttp://www.ontko.com/%7Enathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Sep 04, 2009 at 16:49:24 -0600, Michael Burk wrote: Sorry I didn't make that clear. The patch was the one-liner (datafd only). I ran it again with the Dustin's 2-liner with the same results. Ah, interesting. I wonder if doing a one-line patch against (say) indexfd instead would also prevent the index tee cannot write [ error message. That would imply that just making _any_ fnctl(XX, F_GETFL,0) call is sufficient to get the correct O_NONBLOCK behavior on _all_ file descriptors Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Sep 04, 2009 at 19:40:28 -0400, Nathan Stratton Treadway wrote: On Fri, Sep 04, 2009 at 16:49:24 -0600, Michael Burk wrote: Sorry I didn't make that clear. The patch was the one-liner (datafd only). I ran it again with the Dustin's 2-liner with the same results. Ah, interesting. I wonder if doing a one-line patch against (say) indexfd instead would also prevent the index tee cannot write [ error message. That would imply that just making _any_ fnctl(XX, F_GETFL,0) call is sufficient to get the correct O_NONBLOCK behavior on _all_ file descriptors But, on the other hand, I don't immediately understand why the one-line datafd patch would avoid the sed stdout error that you got when tried Jean-Louis's pipe-blocking.diff patch [from Aug 24], as you reported on Aug 30. That patch, I'm assuming, affected the file descriptor called datafd (though by a different name), but not indexfd, and it seemed to allow the index tee write to succeed, thus allowing the data to flow along far enough to reach the stdout error from sed (which, I belive, is tied to that indexfd file descriptor). The later three-line patch touched both datafd and indexfd, and it seemed to fix that second problem (and the first stayed fixed as well). But it sounds like you are saying that the one-line patch, which touched only datafd, actually fixed both problems as well Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Sep 4, 2009 at 6:21 PM, Nathan Stratton Treadway natha...@ontko.com wrote: snip But it sounds like you are saying that the one-line patch, which touched only datafd, actually fixed both problems as well Exactly.
Re: Backup issues with OpenBSD 4.5 machines
This was a good idea; I tried it with one modification: I determined earlier that the failure happens without indexing also, so I added just the line: fcntl(datafd, F_GETFL, 0); and that fixed the problem as well. So I guess this is truly the minimal patch! -- Michael On Tue, Sep 1, 2009 at 1:43 PM, Nathan Stratton Treadway natha...@ontko.com wrote: On Tue, Sep 01, 2009 at 11:31:26 -0600, Michael Burk wrote: I applied the 3-line patch to the 0831 snapshot and ran a full backup on both machines, with 4 file systems each. All 8 completed successfully with no strange messages. Next, I commented out the 3 new lines and tried the backup again on one of the machines. This time all 4 file systems failed; e.g.: [...] So it seems reliable that those 3 lines fix the problem somehow. Anything else you want to try before I ask for help on the OpenBSD list? I'm no expert on this topic, but if I were investigating something like this, I'd be curious to know if all three of the lines in the patch were necessary for the fix. Since, as you pointed out, the error seems to be tied to the indexing subprocess, I wonder what would happen if you included only the one fcntl(indexfd, F_GETFL, 0); line of the patch, but not the other two Nathan Nathan Stratton Treadway - natha...@ontko.com - Mid-Atlantic region Ray Ontko Co. - Software consulting services - http://www.ontko.com/ GPG Key: http://www.ontko.com/~nathanst/gpg_key.txthttp://www.ontko.com/%7Enathanst/gpg_key.txt ID: 1023D/ECFB6239 Key fingerprint = 6AD8 485E 20B9 5C71 231C 0C32 15F3 ADCD ECFB 6239
Re: Backup issues with OpenBSD 4.5 machines
I modified sendbackup-dump.c to run ktrace, e.g.: /usr/bin/ktrace -id -t censw -f /tmp/sendbackup.trc /sbin/dump 0usf 1048576 - /dev/rsd0d Unfortunately, I don't get a backup, even with the patch applied. The trace output shows write errors because of a broken pipe with or without the patches, like this: 21835 dump CALL write(0x2,0x53fbec,0x8) 21835 dump RET write -1 errno 32 Broken pipe So I suspect my approach is not correct. Any other ideas how I might get some useful trace output? Thanks, Michael On Tue, Sep 1, 2009 at 1:25 PM, Dustin J. Mitchell dus...@zmanda.comwrote: On Tue, Sep 1, 2009 at 3:18 PM, Jean-Louis Martineaumartin...@zmanda.com wrote: I have nothing else to try. The order of system call is a follow: If it's not too hard, it would be nice to have a ktrace or equivalent of this, first to look at here, and second to take to the OpenBSD list. I know that's tricky since this is a daemon process.. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Wed, Sep 2, 2009 at 1:07 PM, Michael Burkbur...@gmail.com wrote: So I suspect my approach is not correct. Any other ideas how I might get some useful trace output? Can you have amandad sleep for, say, 120 seconds just before it launches sendbackup, and somenow notify you of the pid to which you should attach ktrace? Maybe by writing it to /dev/console or to syslog? By the way, it's the launching of sendbackup that seems to be failing, not the launching of dump. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
We need to find a minimal patch that fix the problem. Cat you try the attached patch? Jean-Louis Michael Burk wrote: I applied the patches to the 0827 version again (had to do the sendbackup-dump.c patch by hand since the patch was for another version). I ran a full backup again, and all 3 file systems were successful and there were no strange messages. The debug logs show these lines: 1251756671.281800: sendbackup: fd 3 is set with O_NONBLOCK before dup2 1251756681.457166: sendbackup: Index created successfully This is looking like some bizarre instance of the observer effect! Index: client-src/sendbackup.c === --- client-src/sendbackup.c (revision 2126) +++ client-src/sendbackup.c (working copy) @@ -406,6 +406,11 @@ mesgfd = DATA_FD_OFFSET + 2; indexfd = DATA_FD_OFFSET + 4; } + +fcntl(datafd, F_GETFL, 0); +fcntl(mesgfd, F_GETFL, 0); +fcntl(indexfd, F_GETFL, 0); + if (!dle-create_index) indexfd = -1;
Re: Backup issues with OpenBSD 4.5 machines
I checked the errata for OpenBSD 4.5, but saw nothing that looked related. I applied the patch to the 0831 snapshot and am building it now. After we find the minimal patch, as Jean-Louis said, I'll post on the OpenBSD-misc list to see if anyone has an explanation. Thanks guys for working on this - I know it's a frustrating one. -- Michael On Tue, Sep 1, 2009 at 8:33 AM, Dustin J. Mitchell dus...@zmanda.comwrote: On Tue, Sep 1, 2009 at 7:45 AM, Jean-Louis Martineaumartin...@zmanda.com wrote: We need to find a minimal patch that fix the problem. Cat you try the attached patch? This is starting to look like a kernel bug -- is there an associated OpenBSD bug or something that we could reference in comments in the code to explain this strange formulation? Also, we'll probably need to do this everywhere we fork and plumb a new process -- in pipespawn.c at least. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 1, 2009 at 7:45 AM, Jean-Louis Martineaumartin...@zmanda.com wrote: We need to find a minimal patch that fix the problem. Cat you try the attached patch? This is starting to look like a kernel bug -- is there an associated OpenBSD bug or something that we could reference in comments in the code to explain this strange formulation? Also, we'll probably need to do this everywhere we fork and plumb a new process -- in pipespawn.c at least. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
Since the logs indicate an error with the index tee, I thought I'd try turning off the index generation. The backup still failed, but with a different error: 1251826429.341455: sendbackup: pid 26726 ruid 150 euid 150 version 2.6.2alpha: start at Tue Sep 1 11:33:49 2009 1251826429.342070: sendbackup: Version 2.6.2alpha 1251826429.364640: sendbackup: pid 26726 ruid 150 euid 150 version 2.6.2alpha: rename at Tue Sep 1 11:33:49 2009 1251826429.366628: sendbackup: Parsed request as: program `DUMP' 1251826429.35: sendbackup: disk `/' 1251826429.366690: sendbackup: device `/' 1251826429.366714: sendbackup: level 0 1251826429.366737: sendbackup: since NODATE 1251826429.366761: sendbackup: options `' 1251826429.367758: sendbackup: start: zirconium.example.com:/ lev 0 1251826429.369745: sendbackup: dumping device '/dev/rsd1a' with 'ffs' 1251826429.370181: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd1a in pipeline 1251826429.375384: sendbackup: Started backup 1251826429.395855: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Tue Sep 1 11:33:49 2009 1251826429.460067: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251826429.46: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd1a (/) to standard output 1251826429.538792: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251826430.811052: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251826430.812848: sendbackup: 90: normal(|): DUMP: estimated 47600 tape blocks. 1251826430.814960: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Tue Sep 1 11:33:50 2009 1251826430.820737: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251826430.829936: sendbackup: 90: normal(|): DUMP: End of tape detected 1251826430.833563: sendbackup: 90: normal(|): DUMP: Volume 1 completed at: Tue Sep 1 11:33:50 2009 1251826430.836947: sendbackup: 90: normal(|): DUMP: Change Volumes: Mount volume #2 1251826430.838051: sendbackup: 86: strange(?): DUMP: fopen on /dev/tty fails: Device not configured 1251826430.839081: sendbackup: 90: normal(|): DUMP: The ENTIRE dump is aborted. 1251826430.839642: sendbackup: critical (fatal): error [dump (1011) /sbin/dump returned 3] Hope that helps. -- Michael On Tue, Sep 1, 2009 at 11:31 AM, Michael Burk bur...@gmail.com wrote: I applied the 3-line patch to the 0831 snapshot and ran a full backup on both machines, with 4 file systems each. All 8 completed successfully with no strange messages. Next, I commented out the 3 new lines and tried the backup again on one of the machines. This time all 4 file systems failed; e.g.: 1251825776.552104: sendbackup: pid 24374 ruid 150 euid 150 version 2.6.2alpha: start at Tue Sep 1 11:22:56 2009 1251825776.552688: sendbackup: Version 2.6.2alpha 1251825776.575326: sendbackup: pid 24374 ruid 150 euid 150 version 2.6.2alpha: rename at Tue Sep 1 11:22:56 2009 1251825776.577315: sendbackup: Parsed request as: program `DUMP' 1251825776.577352: sendbackup: disk `/' 1251825776.577377: sendbackup: device `/' 1251825776.577400: sendbackup: level 0 1251825776.577424: sendbackup: since NODATE 1251825776.577446: sendbackup: options `' 1251825776.578210: sendbackup: start: zirconium.example.com:/ lev 0 1251825776.579426: sendbackup: dumping device '/dev/rsd1a' with 'ffs' 1251825776.582918: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd1a in pipeline 1251825776.587089: sendbackup: Started backup 1251825776.619734: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Tue Sep 1 11:22:56 2009 1251825776.632529: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251825776.721646: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251825776.722686: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd1a (/) to standard output 1251825776.779507: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251825778.010107: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251825778.012085: sendbackup: 90: normal(|): DUMP: estimated 47600 tape blocks. 1251825778.013367: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Tue Sep 1 11:22:58 2009 1251825778.020808: sendbackup: critical (fatal): index tee cannot write [Resource temporarily unavailable] 1251825778.020918: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251825778.022805: sendbackup: 115: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1251825778.042344: sendbackup: 90:
Re: Backup issues with OpenBSD 4.5 machines
I applied the 3-line patch to the 0831 snapshot and ran a full backup on both machines, with 4 file systems each. All 8 completed successfully with no strange messages. Next, I commented out the 3 new lines and tried the backup again on one of the machines. This time all 4 file systems failed; e.g.: 1251825776.552104: sendbackup: pid 24374 ruid 150 euid 150 version 2.6.2alpha: start at Tue Sep 1 11:22:56 2009 1251825776.552688: sendbackup: Version 2.6.2alpha 1251825776.575326: sendbackup: pid 24374 ruid 150 euid 150 version 2.6.2alpha: rename at Tue Sep 1 11:22:56 2009 1251825776.577315: sendbackup: Parsed request as: program `DUMP' 1251825776.577352: sendbackup: disk `/' 1251825776.577377: sendbackup: device `/' 1251825776.577400: sendbackup: level 0 1251825776.577424: sendbackup: since NODATE 1251825776.577446: sendbackup: options `' 1251825776.578210: sendbackup: start: zirconium.example.com:/ lev 0 1251825776.579426: sendbackup: dumping device '/dev/rsd1a' with 'ffs' 1251825776.582918: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd1a in pipeline 1251825776.587089: sendbackup: Started backup 1251825776.619734: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Tue Sep 1 11:22:56 2009 1251825776.632529: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251825776.721646: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251825776.722686: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd1a (/) to standard output 1251825776.779507: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251825778.010107: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251825778.012085: sendbackup: 90: normal(|): DUMP: estimated 47600 tape blocks. 1251825778.013367: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Tue Sep 1 11:22:58 2009 1251825778.020808: sendbackup: critical (fatal): index tee cannot write [Resource temporarily unavailable] 1251825778.020918: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251825778.022805: sendbackup: 115: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1251825778.042344: sendbackup: 90: normal(|): DUMP: Broken pipe 1251825778.047559: sendbackup: 90: normal(|): DUMP: The ENTIRE dump is aborted. 1251825778.048234: sendbackup: critical (fatal): error [dump (21317) /sbin/dump returned 3] So it seems reliable that those 3 lines fix the problem somehow. Anything else you want to try before I ask for help on the OpenBSD list? Thanks, Michael On Tue, Sep 1, 2009 at 9:32 AM, Michael Burk bur...@gmail.com wrote: I checked the errata for OpenBSD 4.5, but saw nothing that looked related. I applied the patch to the 0831 snapshot and am building it now. After we find the minimal patch, as Jean-Louis said, I'll post on the OpenBSD-misc list to see if anyone has an explanation. Thanks guys for working on this - I know it's a frustrating one. -- Michael On Tue, Sep 1, 2009 at 8:33 AM, Dustin J. Mitchell dus...@zmanda.comwrote: On Tue, Sep 1, 2009 at 7:45 AM, Jean-Louis Martineaumartin...@zmanda.com wrote: We need to find a minimal patch that fix the problem. Cat you try the attached patch? This is starting to look like a kernel bug -- is there an associated OpenBSD bug or something that we could reference in comments in the code to explain this strange formulation? Also, we'll probably need to do this everywhere we fork and plumb a new process -- in pipespawn.c at least. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
Michael Burk wrote: So it seems reliable that those 3 lines fix the problem somehow. Anything else you want to try before I ask for help on the OpenBSD list? I have nothing else to try. The order of system call is a follow: In amandad process: pipe(pipefd) dup2(pipefd[1],b) fork in the child: exec sendbackup process dup2(b,c) write c #fail with EAGAIN The fix is to do the fcntl call at the beginning of the sendbackup process, before the dup2. Jean-Louis
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Sep 1, 2009 at 3:18 PM, Jean-Louis Martineaumartin...@zmanda.com wrote: I have nothing else to try. The order of system call is a follow: If it's not too hard, it would be nice to have a ktrace or equivalent of this, first to look at here, and second to take to the OpenBSD list. I know that's tricky since this is a daemon process.. Dustin -- Open Source Storage Engineer http://www.zmanda.com
Re: Backup issues with OpenBSD 4.5 machines
You get the error in the index pipe instead of the data path. The backup is correct but your index is empty. That's why you get a STRANGE result instead of a failure. Can you post a sendbackup.*.debug for a dle that failed? Jean-Louis Michael Burk wrote: Hello, I applied the patches to the 20090827 snapshot. I tried it on two OpenBSD 4.5 sparc64 systems, forcing both to do full backups. One system seemed to work on all 3 file systems, the other failed on all 4 file systems with the same errors as before (exactly like what Stan reported). I'm using bsdtcp auth. The system that worked, however, did give a strange result in the report on one filesystem. Here's the corresponding sendbackup debug file: 1251692700.529842: sendbackup: pid 30459 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 2 2:25:00 2009 1251692700.530411: sendbackup: Version 2.6.2alpha 1251692700.553135: sendbackup: pid 30459 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:25:00 2009 1251692700.555394: sendbackup: Parsed request as: program `DUMP' 1251692700.555435: sendbackup: disk `/usr' 1251692700.555461: sendbackup: device `/usr' 1251692700.555484: sendbackup: level 0 1251692700.07: sendbackup: since NODATE 1251692700.30: sendbackup: options `' 1251692700.556750: sendbackup: start: zirconium.example.com:/usr lev 0 1251692700.559495: sendbackup: dumping device '/dev/rsd1d' with 'ffs' 1251692700.566094: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd1d in pipeline 1251692700.571185: sendbackup: Started backup 1251692700.572425: sendbackup: fd 3 is set with O_NONBLOCK before dup2 1251692700.597388: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251692700.611526: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:25: 00 2009 1251692700.669703: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251692700.670724: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd1d (/usr) to standard output 1251692700.763600: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251692705.618345: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251692705.620040: sendbackup: 90: normal(|): DUMP: estimated 1330085 tape blocks. 1251692705.623976: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Sun Aug 30 22:25:05 2009 1251692705.650687: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251692745.051547: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251692747.915767: sendbackup: 115: strange(?): sed: stdout: Resource temporarily unavailable 1251692954.633083: sendbackup: 43:size(|): DUMP: 1437829 tape blocks 1251692954.637414: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:25:00 2009 1251692954.641414: sendbackup: 90: normal(|): DUMP: Volume 1 completed at: Sun Aug 30 22:29:14 2009 1251692954.641858: sendbackup: Index pipe exited with status 1 1251692954.642719: sendbackup: 90: normal(|): DUMP: Volume 1 took 0:04:09 1251692954.643918: sendbackup: 90: normal(|): DUMP: Volume 1 transfer rate: 5774 KB/s 1251692954.645095: sendbackup: 90: normal(|): DUMP: Date this dump completed: Sun Aug 30 22:29:14 2009 1251692954.646239: sendbackup: 90: normal(|): DUMP: Average transfer rate: 5774 KB/s 1251692954.647430: sendbackup: 90: normal(|): DUMP: level 0 dump on Sun Aug 30 22:25:00 2009 1251692954.653480: sendbackup: 90: normal(|): DUMP: DUMP IS DONE 1251692954.653841: sendbackup: Parsed backup messages 1251692954.654032: sendbackup: pid 30459 finish time Sun Aug 30 22:29:14 2009 -- Michael On Fri, Aug 28, 2009 at 5:22 AM, stan st...@panix.com mailto:st...@panix.com wrote: On Thu, Aug 27, 2009 at 03:35:47PM -0600, Michael Burk wrote: Cool - can you send me the patch? Or is it already in the 0826 snapshot? FYI - I got 2.5.0p2 working, but could never get 2.5.1p3 working, even with the same config. No idea why. I started looking at the source code last night in the 0825 snapshot. I'll discontinue that search if there's a patch. This is interesting. I sent Jean-Louis Martineau an email, saying teaks for fixing things, and he replied that the patch he sent me just added debugging, it was not intended to fix anything. Bit, using this snapshot 20090813, and the patch, which I will send to you. I have it working on my 3 OpenBSD 4.5 machines that I have upgraded so far. I ran several test yesterday, and a full fledged backup run last night, and I have not seen any issues since installing this code. I am going to attach the patch to
Re: Backup issues with OpenBSD 4.5 machines
Here are the two sendbackup.*.debug files for the / fs. First: 1251693102.343436: sendbackup: pid 29087 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 22:31:42 2009 1251693102.345326: sendbackup: Version 2.6.2alpha 1251693102.379331: sendbackup: pid 29087 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:31:42 2009 1251693102.382860: sendbackup: Parsed request as: program `DUMP' 1251693102.382940: sendbackup: disk `/' 1251693102.382991: sendbackup: device `/' 1251693102.383039: sendbackup: level 0 1251693102.383086: sendbackup: since NODATE 1251693102.383134: sendbackup: options `' 1251693102.388833: sendbackup: start: bromine.example.com:/ lev 0 1251693102.391475: sendbackup: dumping device '/dev/rsd0a' with 'ffs' 1251693102.398325: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0a in pipeline 1251693102.403120: sendbackup: Started backup 1251693102.461663: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:31:42 2009 1251693102.468669: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251693102.487536: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251693102.519939: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0a (/) to standard output 1251693102.588086: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251693104.583392: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251693104.585824: sendbackup: 90: normal(|): DUMP: estimated 35134 tape blocks. 1251693104.595314: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Sun Aug 30 22:31:44 2009 1251693104.615444: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251693104.919691: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251693105.800795: sendbackup: critical (fatal): index tee cannot write [Resource temporarily unavailable] 1251693105.803486: sendbackup: 115: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1251693105.827339: sendbackup: 90: normal(|): DUMP: Broken pipe 1251693105.832068: sendbackup: 90: normal(|): DUMP: The ENTIRE dump is aborted. 1251693105.839675: sendbackup: critical (fatal): error [dump (7257) /sbin/dump returned 3] Second: 1251693123.840560: sendbackup: pid 32144 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 22:32:03 2009 1251693123.841613: sendbackup: Version 2.6.2alpha 1251693123.892773: sendbackup: pid 32144 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:32:03 2009 1251693123.896360: sendbackup: Parsed request as: program `DUMP' 1251693123.896437: sendbackup: disk `/' 1251693123.896487: sendbackup: device `/' 1251693123.896535: sendbackup: level 0 1251693123.896584: sendbackup: since NODATE 1251693123.896632: sendbackup: options `' 1251693123.902332: sendbackup: start: bromine.example.com:/ lev 0 1251693123.905012: sendbackup: dumping device '/dev/rsd0a' with 'ffs' 1251693123.911778: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0a in pipeline 1251693123.916860: sendbackup: Started backup 1251693123.992261: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251693123.995743: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:32:03 2009 1251693124.009013: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251693124.038977: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0a (/) to standard output 1251693124.143454: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251693126.173916: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251693126.176287: sendbackup: 90: normal(|): DUMP: estimated 35134 tape blocks. 1251693126.185677: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Sun Aug 30 22:32:06 2009 1251693126.205305: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251693126.505055: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251693126.797271: sendbackup: critical (fatal): index tee cannot write [Resource temporarily unavailable] 1251693126.799929: sendbackup: 115: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1251693126.821488: sendbackup: 90: normal(|): DUMP: Broken pipe 1251693126.825377: sendbackup: 90: normal(|): DUMP: The ENTIRE dump is aborted. 1251693126.833573: sendbackup: critical (fatal): error [dump (22791) /sbin/dump returned 3] Thanks, Michael On Mon, Aug 31, 2009 at
Re: Backup issues with OpenBSD 4.5 machines
Do you have the patch I sent to stan on this system? The patch check before and after the write if the pipe is in O_NONBLOCK or not and give an error if it is. I'm totally lost since it is in blocking mode and you get EAGAIN, which is impossible Jean-louis Michael Burk wrote: Here are the two sendbackup.*.debug files for the / fs. First: 1251693102.343436: sendbackup: pid 29087 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 22:31:42 2009 1251693102.345326: sendbackup: Version 2.6.2alpha 1251693102.379331: sendbackup: pid 29087 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:31:42 2009 1251693102.382860: sendbackup: Parsed request as: program `DUMP' 1251693102.382940: sendbackup: disk `/' 1251693102.382991: sendbackup: device `/' 1251693102.383039: sendbackup: level 0 1251693102.383086: sendbackup: since NODATE 1251693102.383134: sendbackup: options `' 1251693102.388833: sendbackup: start: bromine.example.com:/ lev 0 1251693102.391475: sendbackup: dumping device '/dev/rsd0a' with 'ffs' 1251693102.398325: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0a in pipeline 1251693102.403120: sendbackup: Started backup 1251693102.461663: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:31:42 2009 1251693102.468669: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251693102.487536: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251693102.519939: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0a (/) to standard output 1251693102.588086: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251693104.583392: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251693104.585824: sendbackup: 90: normal(|): DUMP: estimated 35134 tape blocks. 1251693104.595314: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Sun Aug 30 22:31:44 2009 1251693104.615444: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251693104.919691: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251693105.800795: sendbackup: critical (fatal): index tee cannot write [Resource temporarily unavailable] 1251693105.803486: sendbackup: 115: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1251693105.827339: sendbackup: 90: normal(|): DUMP: Broken pipe 1251693105.832068: sendbackup: 90: normal(|): DUMP: The ENTIRE dump is aborted. 1251693105.839675: sendbackup: critical (fatal): error [dump (7257) /sbin/dump returned 3] Second: 1251693123.840560: sendbackup: pid 32144 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 22:32:03 2009 1251693123.841613: sendbackup: Version 2.6.2alpha 1251693123.892773: sendbackup: pid 32144 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:32:03 2009 1251693123.896360: sendbackup: Parsed request as: program `DUMP' 1251693123.896437: sendbackup: disk `/' 1251693123.896487: sendbackup: device `/' 1251693123.896535: sendbackup: level 0 1251693123.896584: sendbackup: since NODATE 1251693123.896632: sendbackup: options `' 1251693123.902332: sendbackup: start: bromine.example.com:/ lev 0 1251693123.905012: sendbackup: dumping device '/dev/rsd0a' with 'ffs' 1251693123.911778: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0a in pipeline 1251693123.916860: sendbackup: Started backup 1251693123.992261: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251693123.995743: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:32:03 2009 1251693124.009013: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251693124.038977: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0a (/) to standard output 1251693124.143454: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251693126.173916: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251693126.176287: sendbackup: 90: normal(|): DUMP: estimated 35134 tape blocks. 1251693126.185677: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Sun Aug 30 22:32:06 2009 1251693126.205305: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251693126.505055: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251693126.797271: sendbackup: critical (fatal): index tee cannot write [Resource temporarily unavailable] 1251693126.799929: sendbackup: 115: strange(?): sendbackup: index tee
Re: Backup issues with OpenBSD 4.5 machines
Hi Jean-Louis, I thought I had applied the patches on this machine also, but it turns out I didn't (sorry about that). I applied the patches and ran a new dump. This time all 4 file systems succeeded, though /usr got the same strange message as on the other machine: 1251752282.483677: sendbackup: pid 15574 ruid 150 euid 150 version 2.6.2alpha: start at Mon Aug 31 14:58:02 2009 1251752282.484788: sendbackup: Version 2.6.2alpha 1251752282.523952: sendbackup: pid 15574 ruid 150 euid 150 version 2.6.2alpha: rename at Mon Aug 31 14:58:02 2009 1251752282.527795: sendbackup: Parsed request as: program `DUMP' 1251752282.527880: sendbackup: disk `/usr' 1251752282.527931: sendbackup: device `/usr' 1251752282.527978: sendbackup: level 0 1251752282.528025: sendbackup: since NODATE 1251752282.528072: sendbackup: options `' 1251752282.533704: sendbackup: start: bromine.example.com:/usr lev 0 1251752282.537190: sendbackup: dumping device '/dev/rsd0d' with 'ffs' 1251752282.543664: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0d in pipeline 1251752282.548532: sendbackup: Started backup 1251752282.561558: sendbackup: fd 3 is set with O_NONBLOCK before dup2 1251752282.625055: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251752282.627853: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Mon Aug 31 14:58:02 2009 1251752282.641040: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251752282.754862: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0d (/usr) to standard output 1251752282.792706: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251752286.777085: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251752286.779538: sendbackup: 90: normal(|): DUMP: estimated 653358 tape blocks. 1251752286.789190: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Mon Aug 31 14:58:06 2009 1251752286.827782: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251752290.646004: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251752290.822019: sendbackup: 115: strange(?): sed: stdout: Resource temporarily unavailable 1251752424.852658: sendbackup: 43:size(|): DUMP: 664333 tape blocks 1251752424.854957: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Mon Aug 31 14:58:02 2009 1251752424.857083: sendbackup: 90: normal(|): DUMP: Volume 1 completed at: Mon Aug 31 15:00:24 2009 1251752424.859064: sendbackup: 90: normal(|): DUMP: Volume 1 took 0:02:18 1251752424.860994: sendbackup: 90: normal(|): DUMP: Volume 1 transfer rate: 4814 KB/s 1251752424.863042: sendbackup: 90: normal(|): DUMP: Date this dump completed: Mon Aug 31 15:00:24 2009 1251752424.864993: sendbackup: 90: normal(|): DUMP: Average transfer rate: 4814 KB/s 1251752424.869138: sendbackup: 90: normal(|): DUMP: level 0 dump on Mon Aug 31 14:58:02 2009 1251752424.878026: sendbackup: 90: normal(|): DUMP: DUMP IS DONE 1251752424.888324: sendbackup: Index pipe exited with status 1 1251752424.906986: sendbackup: Parsed backup messages 1251752424.907361: sendbackup: pid 15574 finish time Mon Aug 31 15:00:24 2009 So now I've seen the same behavior that Stan noted: applying the debug patch seems to fix the dump. Also, in case you didn't see it before, here is a snippet from an email thread Stan started on an OpenBSD list: I am trying to get Amanda of a recent vintage working on 4.5. The developrs don't understand how writing to a blocking pipe can return EAGAIN. Should the write just block? Yes, it should, and a quick, about-to-go-to-bed readthrough of the pipe write path shows that it shouldn't; is Amanda doing fcntls to the fd? If so, check those to be absolutely sure that they're not doing something hinkey. Thanks, Michael On Mon, Aug 31, 2009 at 2:41 PM, Jean-Louis Martineau martin...@zmanda.comwrote: Do you have the patch I sent to stan on this system? The patch check before and after the write if the pipe is in O_NONBLOCK or not and give an error if it is. I'm totally lost since it is in blocking mode and you get EAGAIN, which is impossible Jean-louis Michael Burk wrote: Here are the two sendbackup.*.debug files for the / fs. First: 1251693102.343436: sendbackup: pid 29087 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 22:31:42 2009 1251693102.345326: sendbackup: Version 2.6.2alpha 1251693102.379331: sendbackup: pid 29087 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:31:42 2009 1251693102.382860: sendbackup: Parsed request as: program `DUMP' 1251693102.382940: sendbackup: disk `/' 1251693102.382991: sendbackup: device `/'
Re: Backup issues with OpenBSD 4.5 machines
amanda doesn't do fcntl on these file descriptor. But my patch do: fcntl(fd, F_GETFL, 0) to check if it have O_NONBLOCK set, the patch doesn't change it. Can you try the attached patch, it do the same trick for the index file descriptor (remove previous patch before applying). Jean-Louis Michael Burk wrote: Hi Jean-Louis, I thought I had applied the patches on this machine also, but it turns out I didn't (sorry about that). I applied the patches and ran a new dump. This time all 4 file systems succeeded, though /usr got the same strange message as on the other machine: 1251752282.483677: sendbackup: pid 15574 ruid 150 euid 150 version 2.6.2alpha: start at Mon Aug 31 14:58:02 2009 1251752282.484788: sendbackup: Version 2.6.2alpha 1251752282.523952: sendbackup: pid 15574 ruid 150 euid 150 version 2.6.2alpha: rename at Mon Aug 31 14:58:02 2009 1251752282.527795: sendbackup: Parsed request as: program `DUMP' 1251752282.527880: sendbackup: disk `/usr' 1251752282.527931: sendbackup: device `/usr' 1251752282.527978: sendbackup: level 0 1251752282.528025: sendbackup: since NODATE 1251752282.528072: sendbackup: options `' 1251752282.533704: sendbackup: start: bromine.example.com:/usr lev 0 1251752282.537190: sendbackup: dumping device '/dev/rsd0d' with 'ffs' 1251752282.543664: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd0d in pipeline 1251752282.548532: sendbackup: Started backup 1251752282.561558: sendbackup: fd 3 is set with O_NONBLOCK before dup2 1251752282.625055: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251752282.627853: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Mon Aug 31 14:58:02 2009 1251752282.641040: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251752282.754862: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd0d (/usr) to standard output 1251752282.792706: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251752286.777085: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251752286.779538: sendbackup: 90: normal(|): DUMP: estimated 653358 tape blocks. 1251752286.789190: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Mon Aug 31 14:58:06 2009 1251752286.827782: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251752290.646004: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251752290.822019: sendbackup: 115: strange(?): sed: stdout: Resource temporarily unavailable 1251752424.852658: sendbackup: 43:size(|): DUMP: 664333 tape blocks 1251752424.854957: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Mon Aug 31 14:58:02 2009 1251752424.857083: sendbackup: 90: normal(|): DUMP: Volume 1 completed at: Mon Aug 31 15:00:24 2009 1251752424.859064: sendbackup: 90: normal(|): DUMP: Volume 1 took 0:02:18 1251752424.860994: sendbackup: 90: normal(|): DUMP: Volume 1 transfer rate: 4814 KB/s 1251752424.863042: sendbackup: 90: normal(|): DUMP: Date this dump completed: Mon Aug 31 15:00:24 2009 1251752424.864993: sendbackup: 90: normal(|): DUMP: Average transfer rate: 4814 KB/s 1251752424.869138: sendbackup: 90: normal(|): DUMP: level 0 dump on Mon Aug 31 14:58:02 2009 1251752424.878026: sendbackup: 90: normal(|): DUMP: DUMP IS DONE 1251752424.888324: sendbackup: Index pipe exited with status 1 1251752424.906986: sendbackup: Parsed backup messages 1251752424.907361: sendbackup: pid 15574 finish time Mon Aug 31 15:00:24 2009 So now I've seen the same behavior that Stan noted: applying the debug patch seems to fix the dump. Also, in case you didn't see it before, here is a snippet from an email thread Stan started on an OpenBSD list: I am trying to get Amanda of a recent vintage working on 4.5. The developrs don't understand how writing to a blocking pipe can return EAGAIN. Should the write just block? Yes, it should, and a quick, about-to-go-to-bed readthrough of the pipe write path shows that it shouldn't; is Amanda doing fcntls to the fd? If so, check those to be absolutely sure that they're not doing something hinkey. Thanks, Michael On Mon, Aug 31, 2009 at 2:41 PM, Jean-Louis Martineau martin...@zmanda.com mailto:martin...@zmanda.com wrote: Do you have the patch I sent to stan on this system? The patch check before and after the write if the pipe is in O_NONBLOCK or not and give an error if it is. I'm totally lost since it is in blocking mode and you get EAGAIN, which is impossible Jean-louis Michael Burk wrote: Here are the two sendbackup.*.debug files for the / fs. First: 1251693102.343436: sendbackup: pid 29087 ruid 150 euid
Re: Backup issues with OpenBSD 4.5 machines
Hello, I applied the patches to the 20090827 snapshot. I tried it on two OpenBSD 4.5 sparc64 systems, forcing both to do full backups. One system seemed to work on all 3 file systems, the other failed on all 4 file systems with the same errors as before (exactly like what Stan reported). I'm using bsdtcp auth. The system that worked, however, did give a strange result in the report on one filesystem. Here's the corresponding sendbackup debug file: 1251692700.529842: sendbackup: pid 30459 ruid 150 euid 150 version 2.6.2alpha: start at Sun Aug 30 2 2:25:00 2009 1251692700.530411: sendbackup: Version 2.6.2alpha 1251692700.553135: sendbackup: pid 30459 ruid 150 euid 150 version 2.6.2alpha: rename at Sun Aug 30 22:25:00 2009 1251692700.555394: sendbackup: Parsed request as: program `DUMP' 1251692700.555435: sendbackup: disk `/usr' 1251692700.555461: sendbackup: device `/usr' 1251692700.555484: sendbackup: level 0 1251692700.07: sendbackup: since NODATE 1251692700.30: sendbackup: options `' 1251692700.556750: sendbackup: start: zirconium.example.com:/usr lev 0 1251692700.559495: sendbackup: dumping device '/dev/rsd1d' with 'ffs' 1251692700.566094: sendbackup: Spawning /sbin/dump dump 0usf 1048576 - /dev/rsd1d in pipeline 1251692700.571185: sendbackup: Started backup 1251692700.572425: sendbackup: fd 3 is set with O_NONBLOCK before dup2 1251692700.597388: sendbackup: Started index creator: /sbin/restore -tvf - 21 | sed -e ' s/^leaf[]*[0-9]*[ ]*\.// t /^dir[ ]/ { s/^dir[ ]*[0-9]*[ ]*\.// s%$%/% t } d ' 1251692700.611526: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:25: 00 2009 1251692700.669703: sendbackup: 90: normal(|): DUMP: Date of last level 0 dump: the epoch 1251692700.670724: sendbackup: 90: normal(|): DUMP: Dumping /dev/rsd1d (/usr) to standard output 1251692700.763600: sendbackup: 90: normal(|): DUMP: mapping (Pass I) [regular files] 1251692705.618345: sendbackup: 90: normal(|): DUMP: mapping (Pass II) [directories] 1251692705.620040: sendbackup: 90: normal(|): DUMP: estimated 1330085 tape blocks. 1251692705.623976: sendbackup: 90: normal(|): DUMP: Volume 1 started at: Sun Aug 30 22:25:05 2009 1251692705.650687: sendbackup: 90: normal(|): DUMP: dumping (Pass III) [directories] 1251692745.051547: sendbackup: 90: normal(|): DUMP: dumping (Pass IV) [regular files] 1251692747.915767: sendbackup: 115: strange(?): sed: stdout: Resource temporarily unavailable 1251692954.633083: sendbackup: 43:size(|): DUMP: 1437829 tape blocks 1251692954.637414: sendbackup: 90: normal(|): DUMP: Date of this level 0 dump: Sun Aug 30 22:25:00 2009 1251692954.641414: sendbackup: 90: normal(|): DUMP: Volume 1 completed at: Sun Aug 30 22:29:14 2009 1251692954.641858: sendbackup: Index pipe exited with status 1 1251692954.642719: sendbackup: 90: normal(|): DUMP: Volume 1 took 0:04:09 1251692954.643918: sendbackup: 90: normal(|): DUMP: Volume 1 transfer rate: 5774 KB/s 1251692954.645095: sendbackup: 90: normal(|): DUMP: Date this dump completed: Sun Aug 30 22:29:14 2009 1251692954.646239: sendbackup: 90: normal(|): DUMP: Average transfer rate: 5774 KB/s 1251692954.647430: sendbackup: 90: normal(|): DUMP: level 0 dump on Sun Aug 30 22:25:00 2009 1251692954.653480: sendbackup: 90: normal(|): DUMP: DUMP IS DONE 1251692954.653841: sendbackup: Parsed backup messages 1251692954.654032: sendbackup: pid 30459 finish time Sun Aug 30 22:29:14 2009 -- Michael On Fri, Aug 28, 2009 at 5:22 AM, stan st...@panix.com wrote: On Thu, Aug 27, 2009 at 03:35:47PM -0600, Michael Burk wrote: Cool - can you send me the patch? Or is it already in the 0826 snapshot? FYI - I got 2.5.0p2 working, but could never get 2.5.1p3 working, even with the same config. No idea why. I started looking at the source code last night in the 0825 snapshot. I'll discontinue that search if there's a patch. This is interesting. I sent Jean-Louis Martineau an email, saying teaks for fixing things, and he replied that the patch he sent me just added debugging, it was not intended to fix anything. Bit, using this snapshot 20090813, and the patch, which I will send to you. I have it working on my 3 OpenBSD 4.5 machines that I have upgraded so far. I ran several test yesterday, and a full fledged backup run last night, and I have not seen any issues since installing this code. I am going to attach the patch to this email. I would greatly appreciate it, if you would apply this patch to the above snapshot, test, and tell me what your results are. I am very puzzled at the moment. BTW, is your failure the same as mine? That is amcheck passes, and most times you get a PARTIAL backup? BTW, I am using bsdtcp auth and client compression now, but I have made this work with this code, with bsd auth, and no
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 21, 2009 at 03:53:13PM -0600, Michael Burk wrote: Stan's not alone on this one. I have two OpenBSD 4.5 machines also on Sun SPARC hardware. I had this same trouble a couple months ago with 2.6.1, but didn't have time to look deeper. After seeing this discussion, I built amanda-2.6.2alpha-20090812 (can't get 0820 to compile). I'm getting exactly the same errors. I tried dump with and without compression. My Amanda server is on Solaris 10, and I have a mix of Solaris, Linux, and OpenBSD systems. Except for the above-mentioned build, everything is at 2.6.1p1. There are no firewalls on the Amanda server or the OpenBSD clients. I know me too responses aren't usually helpful, but maybe this will rule out the firewall question, at least. Thanks for confirming the problem report. Any cn hance of you trying 2.500, and 2.5.2 to see if the first works, and the 2nd doesn't? Are you using bsd auth? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
I'm using bsdtcp auth. I'll try 2.5.1 today or tomorrow, to see if I can narrow down the range of releases in which OpenBSD support broke. -- Michael On Mon, Aug 24, 2009 at 5:17 AM, stan st...@panix.com wrote: On Fri, Aug 21, 2009 at 03:53:13PM -0600, Michael Burk wrote: Stan's not alone on this one. I have two OpenBSD 4.5 machines also on Sun SPARC hardware. I had this same trouble a couple months ago with 2.6.1, but didn't have time to look deeper. After seeing this discussion, I built amanda-2.6.2alpha-20090812 (can't get 0820 to compile). I'm getting exactly the same errors. I tried dump with and without compression. My Amanda server is on Solaris 10, and I have a mix of Solaris, Linux, and OpenBSD systems. Except for the above-mentioned build, everything is at 2.6.1p1. There are no firewalls on the Amanda server or the OpenBSD clients. I know me too responses aren't usually helpful, but maybe this will rule out the firewall question, at least. Thanks for confirming the problem report. Any cn hance of you trying 2.500, and 2.5.2 to see if the first works, and the 2nd doesn't? Are you using bsd auth? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
This bug can't be fixed until we understand it. Ask on a OpenBSD list how a write to a blocking pipe can return EAGAIN. Or the pipe semantics changed and they don't default to blocking. You can also try the attached, it check the pipe are opened in blocking mode. Jean-Louis stan wrote: On Fri, Aug 21, 2009 at 01:23:29PM -0600, John Hein wrote: stan wrote at 13:56 -0400 on Aug 21, 2009: OK, I reproduced the failure with only a crossover cable between the test client and the Amanda Master: Just because you're using a crossover cable doesn't rule out firewall or other such socket level interference. I'm not saying that's your problem, but using a crossover cable doesn't rule it out. Fair enough, but I think we can rule that out in this case. First, I control the 2 computersin question. The Master backus up 55 machines a day taht consist of a diverse nix of OS'es, and Amanda versions. It's a Linux machine, and I am not running a firwall on it. The client is an OpenBSD machine. I use these for my firewalls, so I am familiar with configuring firewalls on them, Doing so requires (as a minimum) running pf. I am not running pf on the test machine at all. I am not certain whether the sendbackup executable, or amandad opens the port on the client side. My guess is that it is not suceding in doing this. Thus when the Master tries to connect to it, there is no port for it to connect to. Index: amandad-src/amandad.c === --- amandad-src/amandad.c (revision 2110) +++ amandad-src/amandad.c (working copy) @@ -1517,6 +1517,7 @@ struct active_service *as; pid_t pid; int newfd; +int r; assert(security_handle != NULL); assert(cmd != NULL); @@ -1535,15 +1536,40 @@ error(_(pipe: %s\n), strerror(errno)); /*NOTREACHED*/ } + r = fcntl(data_read[i][0], F_GETFL, 0); + if (r O_NONBLOCK) { + error(_(data_read[%d][0] is non blocking), i); + } + r = fcntl(data_read[i][1], F_GETFL, 0); + if (r O_NONBLOCK) { + error(_(data_read[%d][1] is non blocking), i); + } + if (pipe(data_write[i]) 0) { error(_(pipe: %s\n), strerror(errno)); /*NOTREACHED*/ } + r = fcntl(data_write[i][0], F_GETFL, 0); + if (r O_NONBLOCK) { + error(_(data_write[%d][0] is non blocking), i); + } + r = fcntl(data_write[i][1], F_GETFL, 0); + if (r O_NONBLOCK) { + error(_(data_write[%d][1] is non blocking), i); + } } if (pipe(data_write[STDERR_PIPE]) 0) { error(_(pipe: %s\n), strerror(errno)); /*NOTREACHED*/ } +r = fcntl(data_write[STDERR_PIPE][0], F_GETFL, 0); +if (r O_NONBLOCK) { + error(_(data_write[%d][0] is non blocking), STDERR_PIPE); +} +r = fcntl(data_write[STDERR_PIPE][1], F_GETFL, 0); +if (r O_NONBLOCK) { + error(_(data_write[%d][1] is non blocking), STDERR_PIPE); +} switch(pid = fork()) { case -1:
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Aug 24, 2009 at 02:01:09PM -0400, Jean-Louis Martineau wrote: This bug can't be fixed until we understand it. Agreed. Ask on a OpenBSD list how a write to a blocking pipe can return EAGAIN. Or the pipe semantics changed and they don't default to blocking. I already posted the code snipet to the OpenBSD list, but no one has commented on it yet. I have found, in the past, that having a simple reproducable free standing code snippet that demonstrates the problem, can often result in a better result on mailing lists. I have not had time to try to understnad the code well enough to produce such an example, yet. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Aug 24, 2009 at 02:01:09PM -0400, Jean-Louis Martineau wrote: You can also try the attached, it check the pipe are opened in blocking mode. I will try to test that today. Thanks. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Aug 24, 2009 at 02:01:09PM -0400, Jean-Louis Martineau wrote: You can also try the attached, it check the pipe are opened in blocking mode. OK, I applied that to 2.61 (after figuriing out it would not apply to 2.5.2 :-)). Of 4 DLE's on the test OpenBSD machine 3failed with PARTIAL, one tiny one syas complete, and the only signifcant;y sized on seems to be stuck forever at 0.04%, at least that's what amstatus has been reporting for 15 minutes. I grep'ed the amanda debug files for pipe (that's what the new error messages start with right?) and did not find any instances of this. I can send this debug files, if it will help. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 21, 2009 at 09:57:36AM -0600, John Hein wrote: stan wrote at 10:56 -0400 on Aug 21, 2009: OK here is the latest on this saga :-) On one of the OpenBSD 4.5 machines I have built 2.5.0p1, and was able to back this machine up successfully (using classic UDP based authentication) On another of them, I built 2.5.2p1. The first attempt to back this machine up failed. I checked the log files, and found they were having issues because /etc/amdates was missing. I corrected that, and started a 2nd backup run. (Remember amcheck reports all is well with this machine). I got the following from amstatus when I attempted to back up this machine. Also remember, one of the test I ran with a 2.6.1 client was to connect a test machine directly to the client, using a crossover cable to eliminate any firewall, or router type issues. I am attaching, what I think is, the amadnad debug file associated with this failure. Can anyone suggest what I can do to further troubleshoot this? pb48:wd0f 1 dumper: [could not connect DATA stream: can't connect stream to pb48.meadwestvaco.com port 11996: Connection refused] (10:37:27) . . . amandad: time 30.019: stream_accept: timeout after 30 seconds amandad: time 30.019: security_stream_seterr(0x86b67000, can't accept new stream connection: No such file or directory) amandad: time 30.019: stream 0 accept failed: unknown protocol error amandad: time 30.019: security_stream_close(0x86b67000) amandad: time 60.027: stream_accept: timeout after 30 seconds amandad: time 60.027: security_stream_seterr(0x81212000, can't accept new stream connection: No such file or directory) amandad: time 60.027: stream 1 accept failed: unknown protocol error amandad: time 60.027: security_stream_close(0x81212000) amandad: time 90.035: stream_accept: timeout after 30 seconds amandad: time 90.036: security_stream_seterr(0x84877000, can't accept new stream connection: No such file or directory) amandad: time 90.036: stream 2 accept failed: unknown protocol error amandad: time 90.036: security_stream_close(0x84877000) amandad: time 90.036: security_close(handle=0x81bbf800, driver=0x298a9240 (BSD)) amandad: time 120.044: pid 17702 finish time Fri Aug 21 10:39:27 2009 For some reason the socket is not getting marked ready for read. select(2) is timing out waiting. Firewall setup perhaps? This bit of code in 2.5.2p1's common-src/stream.c is where the failure is happening for you... int stream_accept( int server_socket, int timeout, size_t sendsize, size_t recvsize) { SELECT_ARG_TYPE readset; struct timeval tv; int nfound, connected_socket; int save_errno; int ntries = 0; in_port_t port; assert(server_socket = 0); do { ntries++; memset(tv, 0, SIZEOF(tv)); tv.tv_sec = timeout; memset(readset, 0, SIZEOF(readset)); FD_ZERO(readset); FD_SET(server_socket, readset); nfound = select(server_socket+1, readset, NULL, NULL, tv); if(nfound = 0 || !FD_ISSET(server_socket, readset)) { save_errno = errno; if(nfound 0) { dbprintf((%s: stream_accept: select() failed: %s\n, debug_prefix_time(NULL), strerror(save_errno))); } else if(nfound == 0) { dbprintf((%s: stream_accept: timeout after %d second%s\n, debug_prefix_time(NULL), timeout, (timeout == 1) ? : s)); errno = ENOENT; /* ??? */ return -1; The firts thing I notice when comparing this function in 2.5.0 vs 2.5.2 is that 2.5.0 does: tv.tv_usec = 0; and 2.5.2 does not. Could thim make a difference? Both do tv.tv_sec = timeout; -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
stan wrote at 16:59 -0400 on Aug 24, 2009: The firts thing I notice when comparing this function in 2.5.0 vs 2.5.2 is that 2.5.0 does: tv.tv_usec = 0; and 2.5.2 does not. Could thim make a difference? Both do tv.tv_sec = timeout; In 2.5.2, the memset sets the entire struct to 0. 2.5.0 is slightly more efficient, but otherwise the results wind up being the same. Nothing to see there.
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 08:06:46AM -0400, Jean-Louis Martineau wrote: stan wrote: Any thoughts as to why I have twice as many debug files as I expect? When a dump fail, amanda try it a second time. OK here is the latest on this saga :-) On one of the OpenBSD 4.5 machines I have built 2.5.0p1, and was able to back this machine up successfully (using classic UDP based authentication) On another of them, I built 2.5.2p1. The first attempt to back this machine up failed. I checked the log files, and found they were having issues because /etc/amdates was missing. I corrected that, and started a 2nd backup run. (Remember amcheck reports all is well with this machine). I got the following from amstatus when I attempted to back up this machine. Also remember, one of the test I ran with a 2.6.1 client was to connect a test machine directly to the client, using a crossover cable to eliminate any firewall, or router type issues. I am attaching, what I think is, the amadnad debug file associated with this failure. Can anyone suggest what I can do to further troubleshoot this? pb48:wd0f 1 dumper: [could not connect DATA stream: can't connect stream to pb48.meadwestvaco.com port 11996: Connection refused] (10:37:27) -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs. amandad: debug 1 pid 17702 ruid 1002 euid 1002: start at Fri Aug 21 10:37:27 2009 Reading conf file /opt/amanda/etc/amanda-client.conf. amandad: time 0.000: security_getdriver(name=BSD) returns 0x298a9240 amandad: version 2.5.2p1 amandad: time 0.001: build: VERSION=Amanda-2.5.2p1 amandad: time 0.001:BUILT_DATE=Fri Aug 21 07:26:22 EDT 2009 amandad: time 0.001:BUILT_MACH=OpenBSD pb48 4.5 GENERIC#1749 i386 CC=gcc amandad: time 0.001:CONFIGURE_COMMAND='./configure' '--with-config=DailyDump' '--prefix=/opt/amanda' '--with-index-server=amanda.meadwestvaco.com' '--with-tape-server=amanda.meadwestvaco.com' '--with-tape-device=chg-multi' '--with-changer-device=chg-multi' '--with-user=amanda' '--with-group=operator' '--with-debugging' '--sbindir=/opt/amanda/sbin' '--libexecdir=/opt/amanda/libexec' '--libdir=/opt/amanda/lib' '--with-configdir=/opt/amanda/etc' '--with-gnutar-listdir=/opt/amanda/var/gnutar-lists' '--mandir=/opt/amanda/man' '--with-buffered-dump' '--without-rundump' amandad: time 0.001: paths: bindir=/opt/amanda/bin sbindir=/opt/amanda/sbin amandad: time 0.001:libexecdir=/opt/amanda/libexec mandir=/opt/amanda/man amandad: time 0.001:AMANDA_TMPDIR=/tmp/amanda AMANDA_DBGDIR=/tmp/amanda amandad: time 0.001:CONFIG_DIR=/opt/amanda/etc DEV_PREFIX=/dev/ amandad: time 0.001:RDEV_PREFIX=/dev/r DUMP=/sbin/dump amandad: time 0.001:RESTORE=/sbin/restore VDUMP=UNDEF VRESTORE=UNDEF amandad: time 0.001:XFSDUMP=UNDEF XFSRESTORE=UNDEF VXDUMP=UNDEF VXRESTORE=UNDEF amandad: time 0.001:SAMBA_CLIENT=/usr/local/bin/smbclient amandad: time 0.001:GNUTAR=/usr/local/bin/gtar COMPRESS_PATH=/usr/bin/gzip amandad: time 0.001:UNCOMPRESS_PATH=/usr/bin/gzip LPRCMD=/usr/bin/lpr amandad: time 0.001:MAILER=/usr/bin/Mail amandad: time 0.001:listed_incr_dir=/opt/amanda/var/gnutar-lists amandad: time 0.001: defs: DEFAULT_SERVER=amanda.meadwestvaco.com amandad: time 0.001:DEFAULT_CONFIG=DailyDump amandad: time 0.001:DEFAULT_TAPE_SERVER=amanda.meadwestvaco.com amandad: time 0.001:DEFAULT_TAPE_DEVICE=chg-multi HAVE_MMAP NEED_STRSTR amandad: time 0.001:HAVE_SYSVSHM LOCKING=**NONE** DEBUG_CODE AMANDA_DEBUG_DAYS=4 amandad: time 0.001:BSD_SECURITY RSH_SECURITY USE_AMANDAHOSTS amandad: time 0.001:CLIENT_LOGIN=amanda FORCE_USERID HAVE_GZIP amandad: time 0.001:COMPRESS_SUFFIX=.gz COMPRESS_FAST_OPT=--fast amandad: time 0.001:COMPRESS_BEST_OPT=--best UNCOMPRESS_OPT=-dc amandad: time 0.001: dgram_recv(dgram=0x298ad244, timeout=0, fromaddr=0x298bd230) amandad: time 0.001: (sockaddr_in *)0x298bd230 = { 2, 927, 10.209.129.22 } amandad: time 0.001: security_handleinit(handle=0x81bbf800, driver=0x298a9240 (BSD)) amandad: time 0.006: accept recv REQ pkt: SERVICE sendbackup OPTIONS features=9ffe7f;hostname=pb48;config=DailyDump; DUMP wd0f 1 2009:8:20:2:48:55 OPTIONS |;auth=BSD;index; amandad: time 0.006: creating new service: sendbackup OPTIONS features=9ffe7f;hostname=pb48;config=DailyDump; DUMP wd0f 1 2009:8:20:2:48:55 OPTIONS |;auth=BSD;index; amandad: time 0.007: sending ACK pkt: amandad: time 0.007: dgram_send_addr(addr=0x81bbf820, dgram=0x298ad244) amandad: time 0.007: (sockaddr_in *)0x81bbf820 = { 2, 927, 10.209.129.22 } amandad: time 0.007: dgram_send_addr: 0x298ad244-socket = 0 amandad: time 0.013: security_streaminit(stream=0x86b67000, driver=0x298a9240 (BSD)) amandad: time 0.013: try_socksize: send buffer size is 65536
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 08:06:46AM -0400, Jean-Louis Martineau wrote: stan wrote: Any thoughts as to why I have twice as many debug files as I expect? One thing I forgot to put in the previous message is that 10.209.129.22 is the Amanda Master machine's address. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
stan wrote at 10:56 -0400 on Aug 21, 2009: OK here is the latest on this saga :-) On one of the OpenBSD 4.5 machines I have built 2.5.0p1, and was able to back this machine up successfully (using classic UDP based authentication) On another of them, I built 2.5.2p1. The first attempt to back this machine up failed. I checked the log files, and found they were having issues because /etc/amdates was missing. I corrected that, and started a 2nd backup run. (Remember amcheck reports all is well with this machine). I got the following from amstatus when I attempted to back up this machine. Also remember, one of the test I ran with a 2.6.1 client was to connect a test machine directly to the client, using a crossover cable to eliminate any firewall, or router type issues. I am attaching, what I think is, the amadnad debug file associated with this failure. Can anyone suggest what I can do to further troubleshoot this? pb48:wd0f 1 dumper: [could not connect DATA stream: can't connect stream to pb48.meadwestvaco.com port 11996: Connection refused] (10:37:27) . . . amandad: time 30.019: stream_accept: timeout after 30 seconds amandad: time 30.019: security_stream_seterr(0x86b67000, can't accept new stream connection: No such file or directory) amandad: time 30.019: stream 0 accept failed: unknown protocol error amandad: time 30.019: security_stream_close(0x86b67000) amandad: time 60.027: stream_accept: timeout after 30 seconds amandad: time 60.027: security_stream_seterr(0x81212000, can't accept new stream connection: No such file or directory) amandad: time 60.027: stream 1 accept failed: unknown protocol error amandad: time 60.027: security_stream_close(0x81212000) amandad: time 90.035: stream_accept: timeout after 30 seconds amandad: time 90.036: security_stream_seterr(0x84877000, can't accept new stream connection: No such file or directory) amandad: time 90.036: stream 2 accept failed: unknown protocol error amandad: time 90.036: security_stream_close(0x84877000) amandad: time 90.036: security_close(handle=0x81bbf800, driver=0x298a9240 (BSD)) amandad: time 120.044: pid 17702 finish time Fri Aug 21 10:39:27 2009 For some reason the socket is not getting marked ready for read. select(2) is timing out waiting. Firewall setup perhaps? This bit of code in 2.5.2p1's common-src/stream.c is where the failure is happening for you... int stream_accept( int server_socket, int timeout, size_t sendsize, size_t recvsize) { SELECT_ARG_TYPE readset; struct timeval tv; int nfound, connected_socket; int save_errno; int ntries = 0; in_port_t port; assert(server_socket = 0); do { ntries++; memset(tv, 0, SIZEOF(tv)); tv.tv_sec = timeout; memset(readset, 0, SIZEOF(readset)); FD_ZERO(readset); FD_SET(server_socket, readset); nfound = select(server_socket+1, readset, NULL, NULL, tv); if(nfound = 0 || !FD_ISSET(server_socket, readset)) { save_errno = errno; if(nfound 0) { dbprintf((%s: stream_accept: select() failed: %s\n, debug_prefix_time(NULL), strerror(save_errno))); } else if(nfound == 0) { dbprintf((%s: stream_accept: timeout after %d second%s\n, debug_prefix_time(NULL), timeout, (timeout == 1) ? : s)); errno = ENOENT; /* ??? */ return -1;
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 21, 2009 at 09:57:36AM -0600, John Hein wrote: stan wrote at 10:56 -0400 on Aug 21, 2009: OK here is the latest on this saga :-) On one of the OpenBSD 4.5 machines I have built 2.5.0p1, and was able to back this machine up successfully (using classic UDP based authentication) On another of them, I built 2.5.2p1. The first attempt to back this machine up failed. I checked the log files, and found they were having issues because /etc/amdates was missing. I corrected that, and started a 2nd backup run. (Remember amcheck reports all is well with this machine). I got the following from amstatus when I attempted to back up this machine. Also remember, one of the test I ran with a 2.6.1 client was to connect a test machine directly to the client, using a crossover cable to eliminate any firewall, or router type issues. I am attaching, what I think is, the amadnad debug file associated with this failure. Can anyone suggest what I can do to further troubleshoot this? pb48:wd0f 1 dumper: [could not connect DATA stream: can't connect stream to pb48.meadwestvaco.com port 11996: Connection refused] (10:37:27) . . . amandad: time 30.019: stream_accept: timeout after 30 seconds amandad: time 30.019: security_stream_seterr(0x86b67000, can't accept new stream connection: No such file or directory) amandad: time 30.019: stream 0 accept failed: unknown protocol error amandad: time 30.019: security_stream_close(0x86b67000) amandad: time 60.027: stream_accept: timeout after 30 seconds amandad: time 60.027: security_stream_seterr(0x81212000, can't accept new stream connection: No such file or directory) amandad: time 60.027: stream 1 accept failed: unknown protocol error amandad: time 60.027: security_stream_close(0x81212000) amandad: time 90.035: stream_accept: timeout after 30 seconds amandad: time 90.036: security_stream_seterr(0x84877000, can't accept new stream connection: No such file or directory) amandad: time 90.036: stream 2 accept failed: unknown protocol error amandad: time 90.036: security_stream_close(0x84877000) amandad: time 90.036: security_close(handle=0x81bbf800, driver=0x298a9240 (BSD)) amandad: time 120.044: pid 17702 finish time Fri Aug 21 10:39:27 2009 For some reason the socket is not getting marked ready for read. select(2) is timing out waiting. Firewall setup perhaps? This bit of code in 2.5.2p1's common-src/stream.c is where the failure is happening for you... Interesting. Is this different in 2.5.0, than in 2.5.2 ? RE firewall: Always my first suspicion around here :-( But, I tride this with a 2.6.1 on OpenBSD 4.5 yesterday, which I had connected to the Amanda master machine via a crossover cable to eliminate thta, and the failure there _appeared_ to be identical to the original failure. I have noot tried this with a 2.5.2 client. I supose I should do that test.. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 21, 2009 at 09:57:36AM -0600, John Hein wrote: stan wrote at 10:56 -0400 on Aug 21, 2009: OK here is the latest on this saga :-) On one of the OpenBSD 4.5 machines I have built 2.5.0p1, and was able to back this machine up successfully (using classic UDP based authentication) On another of them, I built 2.5.2p1. The first attempt to back this machine up failed. I checked the log files, and found they were having issues because /etc/amdates was missing. I corrected that, and started a 2nd backup run. (Remember amcheck reports all is well with this machine). I got the following from amstatus when I attempted to back up this machine. Also remember, one of the test I ran with a 2.6.1 client was to connect a test machine directly to the client, using a crossover cable to eliminate any firewall, or router type issues. I am attaching, what I think is, the amadnad debug file associated with this failure. Can anyone suggest what I can do to further troubleshoot this? pb48:wd0f 1 dumper: [could not connect DATA stream: can't connect stream to pb48.meadwestvaco.com port 11996: Connection refused] (10:37:27) . . . amandad: time 30.019: stream_accept: timeout after 30 seconds amandad: time 30.019: security_stream_seterr(0x86b67000, can't accept new stream connection: No such file or directory) amandad: time 30.019: stream 0 accept failed: unknown protocol error amandad: time 30.019: security_stream_close(0x86b67000) amandad: time 60.027: stream_accept: timeout after 30 seconds amandad: time 60.027: security_stream_seterr(0x81212000, can't accept new stream connection: No such file or directory) amandad: time 60.027: stream 1 accept failed: unknown protocol error amandad: time 60.027: security_stream_close(0x81212000) amandad: time 90.035: stream_accept: timeout after 30 seconds amandad: time 90.036: security_stream_seterr(0x84877000, can't accept new stream connection: No such file or directory) amandad: time 90.036: stream 2 accept failed: unknown protocol error amandad: time 90.036: security_stream_close(0x84877000) amandad: time 90.036: security_close(handle=0x81bbf800, driver=0x298a9240 (BSD)) amandad: time 120.044: pid 17702 finish time Fri Aug 21 10:39:27 2009 For some reason the socket is not getting marked ready for read. select(2) is timing out waiting. Firewall setup perhaps? This bit of code in 2.5.2p1's common-src/stream.c is where the failure is happening for you... OK, I reproduced the failure with only a crossover cable between the test client and the Amanda Master: 192.168.1.2:wd0f 0 dumper: [could not connect DATA stream: can't connect stream to 192.168.1.2 port 24376: Connection refused] (13:48:23) Note the 192.168.1.2 address :-) This is with a 2.5.2p1 clinet on OpenBSD 4.5 2.5.0p1 works on this same machine/OS/netwrok configuration. So, it appears to me that this must be because of something that changed between 2.5.0p1 and 2.5.2p1. And we have a pretty good idea where in the code this is failing. So can anyone enlighten me as to what chaged in this area between those 2 versions? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
stan wrote at 13:56 -0400 on Aug 21, 2009: OK, I reproduced the failure with only a crossover cable between the test client and the Amanda Master: Just because you're using a crossover cable doesn't rule out firewall or other such socket level interference. I'm not saying that's your problem, but using a crossover cable doesn't rule it out. 192.168.1.2:wd0f 0 dumper: [could not connect DATA stream: can't connect stream to 192.168.1.2 port 24376: Connection refused] (13:48:23) Note the 192.168.1.2 address :-) This is with a 2.5.2p1 clinet on OpenBSD 4.5 2.5.0p1 works on this same machine/OS/netwrok configuration. So, it appears to me that this must be because of something that changed between 2.5.0p1 and 2.5.2p1. And we have a pretty good idea where in the code this is failing. So can anyone enlighten me as to what chaged in this area between those 2 versions? I haven't looked to see what changed between 2.5.0 and 2.5.2. It's all pretty basic socket stuff. I wouldn't be surprised if that is when the additional auth mechanisms (bsdudp, bsdtcp) were added. However, if no one chimes in, it's not that hard to look yourself. If you can narrow it down a bit to where there seems to be a problem in the code, the amanda-hackers@ list might be able to help more.
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 21, 2009 at 01:23:29PM -0600, John Hein wrote: stan wrote at 13:56 -0400 on Aug 21, 2009: OK, I reproduced the failure with only a crossover cable between the test client and the Amanda Master: Just because you're using a crossover cable doesn't rule out firewall or other such socket level interference. I'm not saying that's your problem, but using a crossover cable doesn't rule it out. Fair enough, but I think we can rule that out in this case. First, I control the 2 computersin question. The Master backus up 55 machines a day taht consist of a diverse nix of OS'es, and Amanda versions. It's a Linux machine, and I am not running a firwall on it. The client is an OpenBSD machine. I use these for my firewalls, so I am familiar with configuring firewalls on them, Doing so requires (as a minimum) running pf. I am not running pf on the test machine at all. I am not certain whether the sendbackup executable, or amandad opens the port on the client side. My guess is that it is not suceding in doing this. Thus when the Master tries to connect to it, there is no port for it to connect to. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
Stan's not alone on this one. I have two OpenBSD 4.5 machines also on Sun SPARC hardware. I had this same trouble a couple months ago with 2.6.1, but didn't have time to look deeper. After seeing this discussion, I built amanda-2.6.2alpha-20090812 (can't get 0820 to compile). I'm getting exactly the same errors. I tried dump with and without compression. My Amanda server is on Solaris 10, and I have a mix of Solaris, Linux, and OpenBSD systems. Except for the above-mentioned build, everything is at 2.6.1p1. There are no firewalls on the Amanda server or the OpenBSD clients. I know me too responses aren't usually helpful, but maybe this will rule out the firewall question, at least. On Fri, Aug 21, 2009 at 1:37 PM, stan st...@panix.com wrote: On Fri, Aug 21, 2009 at 01:23:29PM -0600, John Hein wrote: stan wrote at 13:56 -0400 on Aug 21, 2009: OK, I reproduced the failure with only a crossover cable between the test client and the Amanda Master: Just because you're using a crossover cable doesn't rule out firewall or other such socket level interference. I'm not saying that's your problem, but using a crossover cable doesn't rule it out. Fair enough, but I think we can rule that out in this case. First, I control the 2 computersin question. The Master backus up 55 machines a day taht consist of a diverse nix of OS'es, and Amanda versions. It's a Linux machine, and I am not running a firwall on it. The client is an OpenBSD machine. I use these for my firewalls, so I am familiar with configuring firewalls on them, Doing so requires (as a minimum) running pf. I am not running pf on the test machine at all. I am not certain whether the sendbackup executable, or amandad opens the port on the client side. My guess is that it is not suceding in doing this. Thus when the Master tries to connect to it, there is no port for it to connect to. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 10:39:45AM -0400, Jean-Louis Martineau wrote: stan wrote: I know I have asked this before, but I can't recall getting a definative answer. Is port usage different between 2.5.x and 2.6 clients, when talkimng to a 2.6 server? No, the same port are used if configured with the same port range. Do yousee something in the firewall log? Sorry to take so long to get back to everyone on this. I had one of my contractors decide he did not want to work, so I have been trying to do his job, and mine :-( In any case, today I took a crossover cable, and set up a direct 192.169 connection between my Amanda master machine and one of the OpenBSD machines in the lab. h behavior is exactly the same. That is, check passes, estimate gets done, and the dump fails in mid dump. It is interesting that the files on the dumpdisk are of varying sizes, so it's not like the dumps are not getting started. Now that we have eliminated the firewall issues. what can I do next to further diagnose this failure? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 10:39:45AM -0400, Jean-Louis Martineau wrote: stan wrote: I know I have asked this before, but I can't recall getting a definative answer. Is port usage different between 2.5.x and 2.6 clients, when talkimng to a 2.6 server? No, the same port are used if configured with the same port range. Do yousee something in the firewall log? Well, this is getting interesting. I have a working client. It's 2.5.0p1. And it's using BSD auth, rather than bsdtcp, and at the moment I am not using compression. But it works on OpenBSD 4.5. This was the version that was working on one of the OpenBSD 3,9 machines. I recompiled it for I386, and copied over the inetd.conf line, and it worked first time. I made an attempt to get 2.5.2p1 working, but things were not going well with that, and I decided to drop all the way back to a version that i had working on OpenBSD. I am not certain if 2.5.2p1 will work on OpenBSD 4,5 yet or not, as I was having problems getting it set up configuration wise, so I cannot assume that it has the same issue as 2.6.1 yet. So, this raises the question, what changed in sendbackup between 2.5.0p1, and 2.6.1? And given that this narrows the issue down to Amanda, and not the OS or firewalls etc, how can i further troubleshoot the prolem? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
man amanda.conf compress [client|server] string Default: client fast. Set it to none if you don't want compression. ig. compress none Jean-Louis stan wrote: On Fri, Aug 14, 2009 at 03:26:10PM -0400, Jean-Louis Martineau wrote: Check system log Post complete amandad.*.debug and sendbackup.*.debug. You can try to disable client compression. OK, I am very puzzled this morning. Some of the dum,ps failled on the OpenBSD machines last night. I had thought that I had set up dumptypes for them without compression. Here is what I did: define dumptype stan-test { global comment Non-root partitions on reasonably fast machines maxdumps 4 priority medium } And hee is the global def: define dumptype global { comment Global definitions # This is quite useful for setting global parameters, so you don't have # to type them everywhere. All dumptype definitions in this sample file # do include these definitions, either directly or indirectly. # There's nothing special about the name `global'; if you create any # dumptype that does not contain the word `global' or the name of any # other dumptype that contains it, these definitions won't apply. # Note that these definitions may be overridden in other # dumptypes, if the redefinitions appear *after* the `global' # dumptype name. # You may want to use this for globally enabling or disabling # indexing, recording, etc. Some examples: index yes record yes # split_diskbuffer /raid/amanda # fallback_splitsize 64m } And here are the DLE's for the machine in question: pblab wd0a stan-test 1 eth0 pblab wd0f stan-test 1 eth0 pblab wd0d stan-test 1 eth0 pblab wd0e stan-test 1 eth0 Should be no compression, right? But look at what the sendnackup debug files have to say: -rw-r- 1 amanda wheel 2865 Aug 17 22:52 sendbackup.20090817224841.debug -rw-r- 1 amanda wheel 2959 Aug 17 22:55 sendbackup.20090817225226.debug -rw-r- 1 amanda wheel 2851 Aug 17 22:56 sendbackup.20090817225558.debug -rw-r- 1 amanda wheel 3298 Aug 17 22:56 sendbackup.20090817225623.debug -rw-r- 1 amanda wheel 3195 Aug 17 22:56 sendbackup.20090817225638.debug -rw-r- 1 amanda wheel 3192 Aug 17 22:56 sendbackup.20090817225655.debug -rw-r- 1 amanda wheel 3190 Aug 17 22:57 sendbackup.20090817225728.debug -rw-r- 1 amanda wheel 3362 Aug 17 22:58 sendbackup.20090817225803.debug Let's start off with why are there twice as many as I expect, based on the DLE's? The look at this: $ grep -l gzip *backup* sendbackup.20090817224841.debug sendbackup.20090817225226.debug sendbackup.20090817225558.debug sendbackup.20090817225623.debug sendbackup.20090817225638.debug sendbackup.20090817225655.debug sendbackup.20090817225728.debug sendbackup.20090817225803.debug I will atach a tarball with these files I am very puzzled, at the moment. Thanks for any clarification. $
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 07:34:11AM -0400, Jean-Louis Martineau wrote: man amanda.conf compress [client|server] string Default: client fast. Set it to none if you don't want compression. ig. compress none Thanks, I was just going to write an email noting that I had found that, and done as you sugestd. I am starting atest abckup, as we speak. I should have results in an hour or 2. Any thoughts as to why I have twice as many debug files as I expect? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
stan wrote: Any thoughts as to why I have twice as many debug files as I expect? When a dump fail, amanda try it a second time.
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 08:06:46AM -0400, Jean-Louis Martineau wrote: stan wrote: Any thoughts as to why I have twice as many debug files as I expect? When a dump fail, amanda try it a second time. Got it. Now, hee are what I think are the pertinant lines from a dump this morning W/O compression: 250596583.437206: sendbackup: 91: normal(|): DUMP: Volume 1 started at: Tue Aug 18 07:56:23 2009 1250596583.481419: sendbackup: critical (fatal): index tee cannot write [Resourc e temporarily unavailable] 1250596583.484261: sendbackup: 116: strange(?): sendbackup: index tee cannot wri te [Resource temporarily unavailable] 1250596583.485035: sendbackup: 91: normal(|): DUMP: Broken pipe 1250596583.485826: sendbackup: 91: normal(|): DUMP: The ENTIRE dump is aborted. 1250596583.500258: sendbackup: critical (fatal): error [dump (10626) /sbin/dump returned 3] That's from a sendbackup debug file. Do you need me to send the entire debug files? What can I try to help further narow down the problem? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Aug 17, 2009 at 11:30:51AM -0400, Jean-Louis Martineau wrote: And what is the error? It's probably not a gzip: ... One more data point that I need to clarify. Thes 2 OpenBSD machines I am fighting with did have older versions of Amanda on them, before we upgradedd. They had 2.5.0p1, and 2.5.2p1 on them respectivly. I live in a world where people with little understanding of our needs control the netwrk, and there are (IMHO) way to many firwall rules. Is it possible that the port utilizaton has chnaged between these versions and 2.6.1? I'd look at this with wiresharkm but at the moment, compiling that on the newer version of OpenBSD is itself an issue :-( Thanks for the ehlp on this. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
From the write man page: [EAGAIN] The file was marked for non-blocking I/O, and no data could be written immediately. But sendbackup or gzip write the index to a blocking pipe. Maybe it's the firewall that returns that error. You can try to switch to the 'bsdtcp' auth, it is a lot more firewall friendly. Jean-Louis stan wrote: On Tue, Aug 18, 2009 at 08:06:46AM -0400, Jean-Louis Martineau wrote: stan wrote: Any thoughts as to why I have twice as many debug files as I expect? When a dump fail, amanda try it a second time. Got it. Now, hee are what I think are the pertinant lines from a dump this morning W/O compression: 250596583.437206: sendbackup: 91: normal(|): DUMP: Volume 1 started at: Tue Aug 18 07:56:23 2009 1250596583.481419: sendbackup: critical (fatal): index tee cannot write [Resourc e temporarily unavailable] 1250596583.484261: sendbackup: 116: strange(?): sendbackup: index tee cannot wri te [Resource temporarily unavailable] 1250596583.485035: sendbackup: 91: normal(|): DUMP: Broken pipe 1250596583.485826: sendbackup: 91: normal(|): DUMP: The ENTIRE dump is aborted. 1250596583.500258: sendbackup: critical (fatal): error [dump (10626) /sbin/dump returned 3] That's from a sendbackup debug file. Do you need me to send the entire debug files? What can I try to help further narow down the problem?
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 09:32:21AM -0400, Jean-Louis Martineau wrote: From the write man page: [EAGAIN] The file was marked for non-blocking I/O, and no data could be written immediately. But sendbackup or gzip write the index to a blocking pipe. Maybe it's the firewall that returns that error. You can try to switch to the 'bsdtcp' auth, it is a lot more firewall friendly. Same result: 1250605398.423331: sendbackup: 91: normal(|): DUMP: Volume 1 started at: Tue Aug 18 10:23:18 2009 1250605398.435367: sendbackup: critical (fatal): index tee cannot write [Resourc e temporarily unavailable] 1250605398.437948: sendbackup: 116: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1250605398.439291: sendbackup: 91: normal(|): DUMP: Broken pipe 1250605398.440143: sendbackup: 91: normal(|): DUMP: The ENTIRE dump is aborted. 1250605398.440752: sendbackup: critical (fatal): error [dump (7497) /sbin/dump r eturned 3] I know I have asked this before, but I can't recall getting a definative answer. Is port usage different between 2.5.x and 2.6 clients, when talkimng to a 2.6 server? -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 10:39:45AM -0400, Jean-Louis Martineau wrote: stan wrote: I know I have asked this before, but I can't recall getting a definative answer. Is port usage different between 2.5.x and 2.6 clients, when talkimng to a 2.6 server? No, the same port are used if configured with the same port range. Do yousee something in the firewall log? Geting access to that is um, chalenging. I may be better off setting up a network i have control of in the lab, and see if I can reproduce the problem there. After all, we are not _certain- it's a firewall issue ata all. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
stan wrote: I know I have asked this before, but I can't recall getting a definative answer. Is port usage different between 2.5.x and 2.6 clients, when talkimng to a 2.6 server? No, the same port are used if configured with the same port range. Do yousee something in the firewall log? Jean-Louis
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Aug 17, 2009 at 4:32 PM, stanst...@panix.com wrote: On Tue, Aug 18, 2009 at 09:32:21AM -0400, Jean-Louis Martineau wrote: From the write man page: [EAGAIN] The file was marked for non-blocking I/O, and no data could be written immediately. But sendbackup or gzip write the index to a blocking pipe. Maybe it's the firewall that returns that error. You can try to switch to the 'bsdtcp' auth, it is a lot more firewall friendly. Same result: 1250605398.423331: sendbackup: 91: normal(|): DUMP: Volume 1 started at: Tue Aug 18 10:23:18 2009 1250605398.435367: sendbackup: critical (fatal): index tee cannot write [Resourc e temporarily unavailable] 1250605398.437948: sendbackup: 116: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1250605398.439291: sendbackup: 91: normal(|): DUMP: Broken pipe 1250605398.440143: sendbackup: 91: normal(|): DUMP: The ENTIRE dump is aborted. 1250605398.440752: sendbackup: critical (fatal): error [dump (7497) /sbin/dump r eturned 3] I know I have asked this before, but I can't recall getting a definative answer. Is port usage different between 2.5.x and 2.6 clients, when talkimng to a 2.6 server? i have no idea what else your system is doing. is it possible you're simply running out of file descriptors? pstat -T | grep open
Re: Backup issues with OpenBSD 4.5 machines
On Tue, Aug 18, 2009 at 01:52:56PM -0400, Dewey Hylton wrote: On Mon, Aug 17, 2009 at 4:32 PM, stanst...@panix.com wrote: On Tue, Aug 18, 2009 at 09:32:21AM -0400, Jean-Louis Martineau wrote: From the write man page: ? ? [EAGAIN] ? ? ? The file was marked for non-blocking I/O, and no data ? ? could ? ? ? ? ? ? ? ? ? ?be written immediately. But sendbackup or gzip write the index to a blocking pipe. Maybe it's the firewall that returns that error. You can try to switch to the 'bsdtcp' auth, it is a lot more firewall friendly. Same result: 1250605398.423331: sendbackup: ?91: ?normal(|): ? DUMP: Volume 1 started at: Tue ?Aug 18 10:23:18 2009 1250605398.435367: sendbackup: critical (fatal): index tee cannot write [Resourc e temporarily unavailable] 1250605398.437948: sendbackup: 116: strange(?): sendbackup: index tee cannot write [Resource temporarily unavailable] 1250605398.439291: sendbackup: ?91: ?normal(|): ? DUMP: Broken pipe 1250605398.440143: sendbackup: ?91: ?normal(|): ? DUMP: The ENTIRE dump is aborted. 1250605398.440752: sendbackup: critical (fatal): error [dump (7497) /sbin/dump r ?eturned 3] ?I know I have asked this before, but I can't recall getting a definative ?answer. Is port usage different between 2.5.x and 2.6 clients, when ?talkimng to a 2.6 server? i have no idea what else your system is doing. is it possible you're simply running out of file descriptors? pstat -T | grep open A useful thought, but it does not appear this is the problem $ pstat -T | grep open 169/7030 open files $ -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 14, 2009 at 03:26:10PM -0400, Jean-Louis Martineau wrote: Check system log Post complete amandad.*.debug and sendbackup.*.debug. You can try to disable client compression. It still fails with client side compression truned off. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
And what is the error? It's probably not a gzip: ... Jean-Louis stan wrote: On Fri, Aug 14, 2009 at 03:26:10PM -0400, Jean-Louis Martineau wrote: Check system log Post complete amandad.*.debug and sendbackup.*.debug. You can try to disable client compression. It still fails with client side compression truned off.
Re: Backup issues with OpenBSD 4.5 machines
On Mon, Aug 17, 2009 at 11:30:51AM -0400, Jean-Louis Martineau wrote: And what is the error? It's probably not a gzip: ... Interesting. I set up a test doing just one of these machine (4 DLE's), and it worked in non-compressed. I will leave a couple of the OpenBSD machines set fro non-compressed in tonights ru, and see what ahppens. Thanks. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
gzip: stdout: Resource temporarily unavailable I have no idea what's that error mean. Can you try the latest 2.6.1p1 snapshot from http://www.zmanda.com/community-builds.php Jean-Louis stan wrote: WE deployed 3 OpenBSD machines yesterday to replace older OpenBSD machines that had been backing up happily. I honestly cannot remember which version of Amanda was on these machines. The new ones have 2.6.1. The Amanda master machine is 2.6.1, and has been happily backing up 55 machines every night prior to this. At least one filessytem failed on each of the upgraded machines on last nights run. I manually started a run again this morning, and was able to reproduce the failure. I am attaching 2 sendbackup debug files from one of the machines. I am not quite certain why there are 2 files. I deleted all the Amanda debug files, prior to doing the manual run, so I so I expected just to have one debug file for a given filessytem, but found 2. What can I do to further debug this?
Re: Backup issues with OpenBSD 4.5 machines
On Fri, Aug 14, 2009 at 09:50:22AM -0400, Jean-Louis Martineau wrote: gzip: stdout: Resource temporarily unavailable I have no idea what's that error mean. Can you try the latest 2.6.1p1 snapshot from http://www.zmanda.com/community-builds.php Jean-Louis OK, I compiled and installed that version. Re-ran a backup, and the results were similar. A total of 2 of the 12 filesystems on 3 machines suceded, the rest were PARTIAL. Debuging sugestions welcome. -- One of the main causes of the fall of the roman empire was that, lacking zero, they had no way to indicate successful termination of their C programs.
Re: Backup issues with OpenBSD 4.5 machines
Check system log Post complete amandad.*.debug and sendbackup.*.debug. You can try to disable client compression. Jean-Louis stan wrote: On Fri, Aug 14, 2009 at 09:50:22AM -0400, Jean-Louis Martineau wrote: gzip: stdout: Resource temporarily unavailable I have no idea what's that error mean. Can you try the latest 2.6.1p1 snapshot from http://www.zmanda.com/community-builds.php Jean-Louis OK, I compiled and installed that version. Re-ran a backup, and the results were similar. A total of 2 of the 12 filesystems on 3 machines suceded, the rest were PARTIAL. Debuging sugestions welcome.