Error redirecting stderr to fd 52
On a 2.6.1b1 client ... 1251090802.506210: sendbackup: pid 61161 ruid 5001 euid 5001 version 2.6.1b1: start at Sun Aug 23 23:13:22 2009 1251090802.506278: sendbackup: Version 2.6.1b1 1251090802.511032: sendbackup: pid 61161 ruid 5001 euid 5001 version 2.6.1b1: rename at Sun Aug 23 23:13:22 2009 1251090802.511055: sendbackup: sendbackup req: GNUTAR /data 1 2009:8:21:6:55:50 OPTIONS |;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar; 1251090802.511102: sendbackup: Parsed request as: program `GNUTAR' 1251090802.511109: sendbackup: disk `/data' 1251090802.53: sendbackup: device `/data' 1251090802.58: sendbackup: level 1 1251090802.511123: sendbackup: since 2009:8:21:6:55:50 1251090802.511128: sendbackup: options `|;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar;' 1251090802.511203: sendbackup: Error redirecting stderr to fd 52: Bad file descriptor 1251090802.511215: sendbackup: pid 61161 finish time Sun Aug 23 23:13:22 2009 Has anyone ever seen that? This is on a client with about a dozen DLEs with possibly 3 dumping in parallel at a time. The DLE in question is not small - certainly not so small to complete in 5 ms. This DLE sometimes works. Sometimes a different one fails the same way. It looks like the mesgfd in client-src/sendbackup.c is getting closed before dup2(2) runs. Perhaps a race. amandad log (below) shows no obvious trouble other than it is continuing to do work after the child sendbackup has failed (as seen in the log output above). The two security_stream_close messages seem to be different than the log messages associated with DLEs that worked. The working ones have three security_stream_close messages. But, oddly, the DLE that worked (and was small) right _before_ the failed DLE did _not_ have any security_stream_close messages. Possibly a clue. In fact, it seems all the failures are happening right after a small ( 10 MB), and thus quick, dump. Could just be a coincidence. 1251090802.228997: amandad: dgram_recv(dgram=0x280c2a04, timeout=0, fromaddr=0x280d29f0) 1251090802.229037: amandad: (sockaddr_in *)0x280d29f0 = { 2, 703, 206.168.13.161 } 1251090802.229055: amandad: security_handleinit(handle=0x8052600, driver=0x280bc520 (BSD)) 1251090802.235787: amandad: accept recv REQ pkt: SERVICE sendbackup OPTIONS features=9ffe00;hostname=bunny;config=test; GNUTAR /data 1 2009:8:21:6:55:50 OPTIONS |;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar; 1251090802.237043: amandad: creating new service: sendbackup OPTIONS features=9ffe00;hostname=bunny;config=test; GNUTAR /data 1 2009:8:21:6:55:50 OPTIONS |;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar; 1251090802.237710: amandad: sending ACK pkt: 1251090802.237764: amandad: dgram_send_addr(addr=0x8052620, dgram=0x280c2a04) 1251090802.237772: amandad: (sockaddr_in *)0x8052620 = { 2, 703, 206.168.13.161 } 1251090802.237779: amandad: dgram_send_addr: 0x280c2a04-socket = 0 1251090802.511364: amandad: security_streaminit(stream=0x81dd000, driver=0x280bc520 (BSD)) 1251090802.511719: amandad: stream_server opening socket with family 2 (requested family was 2) 1251090802.511736: amandad: try_socksize: send buffer size is 65536 1251090802.511743: amandad: try_socksize: receive buffer size is 65536 1251090802.512604: amandad: bind_portrange2: Try port 6108: Available - Success 1251090802.512617: amandad: stream_server: waiting for connection: 0.0.0.0.6108 1251090802.512643: amandad: security_streaminit(stream=0x81e6000, driver=0x280bc520 (BSD)) 1251090802.512658: amandad: stream_server opening socket with family 2 (requested family was 2) 1251090802.512669: amandad: try_socksize: send buffer size is 65536 1251090802.512677: amandad: try_socksize: receive buffer size is 65536 1251090802.513496: amandad: bind_portrange2: Try port 6108: Available - Address already in use 1251090802.514300: amandad: bind_portrange2: Try port 6109: Available - Success 1251090802.514311: amandad: stream_server: waiting for connection: 0.0.0.0.6109 1251090802.514319: amandad: security_streaminit(stream=0x81ef000, driver=0x280bc520 (BSD)) 1251090802.514333: amandad: stream_server opening socket with family 2 (requested family was 2) 1251090802.514344: amandad: try_socksize: send buffer size is 65536 1251090802.514351: amandad: try_socksize: receive buffer size is 65536 1251090802.515189: amandad: bind_portrange2: Try port 6108: Available - Address already in use 1251090802.515991: amandad: bind_portrange2: Try port 6109: Available - Address already in use 1251090802.516772: amandad: bind_portrange2: Skip port 6110: Owned by softcm. 1251090802.517540: amandad: bind_portrange2: Skip port 6111: Owned by spc. 1251090802.518336: amandad: bind_portrange2: Try port 6112: Available - Success
Re: Error redirecting stderr to fd 52
John Hein wrote: On a 2.6.1b1 client ... Hmm, beta software ... It's not fixed in 2.6.1 neither in 2.6.1p1. You must use the latest 2.6.1p1 snapshot from http://www.zmanda.com/community-builds.php Jean-Louis
Re: Error redirecting stderr to fd 52
Jean-Louis Martineau wrote at 11:48 -0400 on Aug 24, 2009: John Hein wrote: On a 2.6.1b1 client ... Hmm, beta software ... It's not fixed in 2.6.1 neither in 2.6.1p1. You must use the latest 2.6.1p1 snapshot from http://www.zmanda.com/community-builds.php Building a new version now. Another interesting note. On the client in question, amandad is still running, but it shouldn't be. It's got a couple unreaped zombies and is waiting in select. ps awwx -o user,pid,ppid,start,stat,wchan,command | grep backup backup 37628 60010 4:43AM Z- defunct backup 37629 60010 4:43AM Z- defunct backup 60010 23993 7:01PM Ss select amandad lsof shows quite a few file descriptors still open. COMMAND PID USER FD TYPE DEVICE SIZE/OFFNODE NAME amandad 60010 backup0u IPv4 0xc87f5438 0t0 UDP *:amanda amandad 60010 backup1u IPv4 0xc87f5438 0t0 UDP *:amanda amandad 60010 backup2u IPv4 0xc87f5438 0t0 UDP *:amanda amandad 60010 backup4u IPv4 0xcdd09bf4 0t0 UDP *:58068 amandad 60010 backup 10w VREG 0,88 107863 331113 / -- amandad/amandad.20090823190102.debug amandad 60010 backup 21u PIPE 0xcb0064c816384 -0xcb006580 amandad 60010 backup 22u PIPE 0xc916a99016384 -0xc916aa48 amandad 60010 backup 23u PIPE 0xcd62999016384 -0xcd629a48 amandad 60010 backup 24u PIPE 0xce8a899016384 -0xce8a8a48 amandad 60010 backup 28u PIPE 0xc86b47f816384 -0xc86b48b0 amandad 60010 backup 29u IPv4 0xce6f63a0 0t0 TCP *:6108 (LISTEN) amandad 60010 backup 30u PIPE 0xc8d2f33016384 -0xc8d2f3e8 amandad 60010 backup 31u PIPE 0xc946619816384 -0xc9466250 amandad 60010 backup 33u PIPE 0xca15c33016384 -0xca15c3e8 amandad 60010 backup 37u IPv4 0t0 TCP no PCB, CANTSENDMORE, CANTRCVMORE amandad 60010 backup 39u PIPE 0xcdeb4cc016384 -0xcdeb4d78 amandad 60010 backup 40u PIPE 0xc9c307f816384 -0xc9c308b0 amandad 60010 backup 42u PIPE 0xc96317f816384 -0xc96318b0 amandad 60010 backup 44u PIPE 0xc9c3066016384 -0xc9c30718 amandad 60010 backup 46u PIPE 0xc9c1d33016384 -0xc9c1d3e8 amandad 60010 backup 47u PIPE 0xc914e7180 -0xc914e660 amandad 60010 backup 48u PIPE 0xcc4dd4c816384 -0xcc4dd580 amandad 60010 backup 49u PIPE 0xc96be7f816384 -0xc96be8b0 amandad 60010 backup 50u PIPE 0xce8a833016384 -0xce8a83e8 amandad 60010 backup 51u PIPE 0xc914e5800 -0xc914e4c8 amandad 60010 backup 52u PIPE 0xc914ecc016384 -0xc914ed78 amandad 60010 backup 60u PIPE 0xc8e1033016384 -0xc8e103e8 amandad 60010 backup 61u PIPE 0xcc37fb2816384 -0xcc37fbe0 amandad 60010 backup 62u PIPE 0xce64966016384 -0xce649718 amandad 60010 backup 63u PIPE 0xcaa0099016384 -0xcaa00a48 amandad 60010 backup 64u PIPE 0xc912f66016384 -0xc912f718 amandad 60010 backup 65u PIPE 0xc90f1cc016384 -0xc90f1d78 A few seconds of tracing the process shows.. 60010 amandad RET poll 0 60010 amandad CALL gettimeofday(0xbfbfea88,0) 60010 amandad RET gettimeofday 0 60010 amandad CALL gettimeofday(0xbfbfea88,0) 60010 amandad RET gettimeofday 0 60010 amandad CALL poll(0x8052600,0x1,0x7530)