Error redirecting stderr to fd 52

2009-08-24 Thread John Hein
On a 2.6.1b1 client ...

1251090802.506210: sendbackup: pid 61161 ruid 5001 euid 5001 version 2.6.1b1: 
start at Sun Aug 23 23:13:22 2009
1251090802.506278: sendbackup: Version 2.6.1b1
1251090802.511032: sendbackup: pid 61161 ruid 5001 euid 5001 version 2.6.1b1: 
rename at Sun Aug 23 23:13:22 2009
1251090802.511055: sendbackup:   sendbackup req: GNUTAR /data  1 
2009:8:21:6:55:50 OPTIONS 
|;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar;
1251090802.511102: sendbackup:   Parsed request as: program `GNUTAR'
1251090802.511109: sendbackup:  disk `/data'
1251090802.53: sendbackup:  device `/data'
1251090802.58: sendbackup:  level 1
1251090802.511123: sendbackup:  since 2009:8:21:6:55:50
1251090802.511128: sendbackup:  options 
`|;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar;'
1251090802.511203: sendbackup: Error redirecting stderr to fd 52: Bad file 
descriptor
1251090802.511215: sendbackup: pid 61161 finish time Sun Aug 23 23:13:22 2009


Has anyone ever seen that?

This is on a client with about a dozen DLEs with possibly 3
dumping in parallel at a time.  The DLE in question is
not small - certainly not so small to complete in 5 ms.

This DLE sometimes works.  Sometimes a different one fails the same
way.

It looks like the mesgfd in client-src/sendbackup.c is getting
closed before dup2(2) runs.  Perhaps a race.

amandad log (below) shows no obvious trouble other than it is continuing to do
work after the child sendbackup has failed (as seen in the log output above).

The two security_stream_close messages seem to be different than the
log messages associated with DLEs that worked.  The working ones have
three security_stream_close messages.  But, oddly, the DLE that worked
(and was small) right _before_ the failed DLE did _not_ have any
security_stream_close messages.  Possibly a clue.  In fact, it
seems all the failures are happening right after a small ( 10 MB),
and thus quick, dump.  Could just be a coincidence.

1251090802.228997: amandad: dgram_recv(dgram=0x280c2a04, timeout=0, 
fromaddr=0x280d29f0)
1251090802.229037: amandad: (sockaddr_in *)0x280d29f0 = { 2, 703, 
206.168.13.161 }
1251090802.229055: amandad: security_handleinit(handle=0x8052600, 
driver=0x280bc520 (BSD))
1251090802.235787: amandad: accept recv REQ pkt:

SERVICE sendbackup
OPTIONS features=9ffe00;hostname=bunny;config=test;
GNUTAR /data  1 2009:8:21:6:55:50 OPTIONS 
|;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar;

1251090802.237043: amandad: creating new service: sendbackup
OPTIONS features=9ffe00;hostname=bunny;config=test;
GNUTAR /data  1 2009:8:21:6:55:50 OPTIONS 
|;auth=bsd;compress-fast;index;exclude-list=/site/etc/amanda/exclude-gtar;

1251090802.237710: amandad: sending ACK pkt:


1251090802.237764: amandad: dgram_send_addr(addr=0x8052620, dgram=0x280c2a04)
1251090802.237772: amandad: (sockaddr_in *)0x8052620 = { 2, 703, 206.168.13.161 
}
1251090802.237779: amandad: dgram_send_addr: 0x280c2a04-socket = 0
1251090802.511364: amandad: security_streaminit(stream=0x81dd000, 
driver=0x280bc520 (BSD))
1251090802.511719: amandad: stream_server opening socket with family 2 
(requested family was 2)
1251090802.511736: amandad: try_socksize: send buffer size is 65536
1251090802.511743: amandad: try_socksize: receive buffer size is 65536
1251090802.512604: amandad: bind_portrange2: Try  port 6108: Available - Success
1251090802.512617: amandad: stream_server: waiting for connection: 0.0.0.0.6108
1251090802.512643: amandad: security_streaminit(stream=0x81e6000, 
driver=0x280bc520 (BSD))
1251090802.512658: amandad: stream_server opening socket with family 2 
(requested family was 2)
1251090802.512669: amandad: try_socksize: send buffer size is 65536
1251090802.512677: amandad: try_socksize: receive buffer size is 65536
1251090802.513496: amandad: bind_portrange2: Try  port 6108: Available - 
Address already in use
1251090802.514300: amandad: bind_portrange2: Try  port 6109: Available - Success
1251090802.514311: amandad: stream_server: waiting for connection: 0.0.0.0.6109
1251090802.514319: amandad: security_streaminit(stream=0x81ef000, 
driver=0x280bc520 (BSD))
1251090802.514333: amandad: stream_server opening socket with family 2 
(requested family was 2)
1251090802.514344: amandad: try_socksize: send buffer size is 65536
1251090802.514351: amandad: try_socksize: receive buffer size is 65536
1251090802.515189: amandad: bind_portrange2: Try  port 6108: Available - 
Address already in use
1251090802.515991: amandad: bind_portrange2: Try  port 6109: Available - 
Address already in use
1251090802.516772: amandad: bind_portrange2: Skip port 6110: Owned by softcm.
1251090802.517540: amandad: bind_portrange2: Skip port 6111: Owned by spc.
1251090802.518336: amandad: bind_portrange2: Try  port 6112: Available - Success

Re: Error redirecting stderr to fd 52

2009-08-24 Thread Jean-Louis Martineau

John Hein wrote:

On a 2.6.1b1 client ...
  

Hmm, beta software ...

It's not fixed in 2.6.1 neither in 2.6.1p1.

You must use the latest 2.6.1p1 snapshot from 
http://www.zmanda.com/community-builds.php


Jean-Louis


Re: Error redirecting stderr to fd 52

2009-08-24 Thread John Hein
Jean-Louis Martineau wrote at 11:48 -0400 on Aug 24, 2009:
  John Hein wrote:
   On a 2.6.1b1 client ...
 
  Hmm, beta software ...
  
  It's not fixed in 2.6.1 neither in 2.6.1p1.
  
  You must use the latest 2.6.1p1 snapshot from 
  http://www.zmanda.com/community-builds.php

Building a new version now.

Another interesting note.  On the client in question, amandad is still
running, but it shouldn't be.  It's got a couple unreaped zombies and
is waiting in select.

ps awwx -o user,pid,ppid,start,stat,wchan,command | grep backup
backup   37628 60010  4:43AM Z-  defunct
backup   37629 60010  4:43AM Z-  defunct
backup   60010 23993  7:01PM Ss   select amandad


lsof shows quite a few file descriptors still open.

COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFFNODE NAME
amandad 60010 backup0u  IPv4   0xc87f5438  0t0 UDP *:amanda
amandad 60010 backup1u  IPv4   0xc87f5438  0t0 UDP *:amanda
amandad 60010 backup2u  IPv4   0xc87f5438  0t0 UDP *:amanda
amandad 60010 backup4u  IPv4   0xcdd09bf4  0t0 UDP *:58068
amandad 60010 backup   10w  VREG 0,88   107863  331113 / -- 
amandad/amandad.20090823190102.debug
amandad 60010 backup   21u  PIPE   0xcb0064c816384 -0xcb006580
amandad 60010 backup   22u  PIPE   0xc916a99016384 -0xc916aa48
amandad 60010 backup   23u  PIPE   0xcd62999016384 -0xcd629a48
amandad 60010 backup   24u  PIPE   0xce8a899016384 -0xce8a8a48
amandad 60010 backup   28u  PIPE   0xc86b47f816384 -0xc86b48b0
amandad 60010 backup   29u  IPv4   0xce6f63a0  0t0 TCP *:6108 (LISTEN)
amandad 60010 backup   30u  PIPE   0xc8d2f33016384 -0xc8d2f3e8
amandad 60010 backup   31u  PIPE   0xc946619816384 -0xc9466250
amandad 60010 backup   33u  PIPE   0xca15c33016384 -0xca15c3e8
amandad 60010 backup   37u  IPv4   0t0 TCP no PCB, 
CANTSENDMORE, CANTRCVMORE
amandad 60010 backup   39u  PIPE   0xcdeb4cc016384 -0xcdeb4d78
amandad 60010 backup   40u  PIPE   0xc9c307f816384 -0xc9c308b0
amandad 60010 backup   42u  PIPE   0xc96317f816384 -0xc96318b0
amandad 60010 backup   44u  PIPE   0xc9c3066016384 -0xc9c30718
amandad 60010 backup   46u  PIPE   0xc9c1d33016384 -0xc9c1d3e8
amandad 60010 backup   47u  PIPE   0xc914e7180 -0xc914e660
amandad 60010 backup   48u  PIPE   0xcc4dd4c816384 -0xcc4dd580
amandad 60010 backup   49u  PIPE   0xc96be7f816384 -0xc96be8b0
amandad 60010 backup   50u  PIPE   0xce8a833016384 -0xce8a83e8
amandad 60010 backup   51u  PIPE   0xc914e5800 -0xc914e4c8
amandad 60010 backup   52u  PIPE   0xc914ecc016384 -0xc914ed78
amandad 60010 backup   60u  PIPE   0xc8e1033016384 -0xc8e103e8
amandad 60010 backup   61u  PIPE   0xcc37fb2816384 -0xcc37fbe0
amandad 60010 backup   62u  PIPE   0xce64966016384 -0xce649718
amandad 60010 backup   63u  PIPE   0xcaa0099016384 -0xcaa00a48
amandad 60010 backup   64u  PIPE   0xc912f66016384 -0xc912f718
amandad 60010 backup   65u  PIPE   0xc90f1cc016384 -0xc90f1d78

A few seconds of tracing the process shows..

 60010 amandad  RET   poll 0
 60010 amandad  CALL  gettimeofday(0xbfbfea88,0)
 60010 amandad  RET   gettimeofday 0
 60010 amandad  CALL  gettimeofday(0xbfbfea88,0)
 60010 amandad  RET   gettimeofday 0
 60010 amandad  CALL  poll(0x8052600,0x1,0x7530)