Jason Dixon wrote: > a "Packet size too big" error. The Director resides on a global zone in > Solaris x86. I've managed to capture a truss during one of the > failures: > http://mirrors.omniti.com/bacula/bacula.truss
Very strange. Everything seems to be going normally: 14106/1: pollsys(0x08046EF0, 1, 0x00000000, 0x00000000) = 1 14106/1: fd=4 ev=POLLRDNORM rev=POLLRDNORM 14106/1: accept(4, 0x08047D90, 0x08047DA0, SOV_DEFAULT) = 5 14106/1: AF_INET name = 10.80.117.97 port = 40563 [...] 14106/67: read(5, "\0\0\0 ", 4) = 4 14106/67: read(5, " H e l l o D i r e c t".., 32) = 32 [Incoming connection from the director] [...] [The director tells the FD to back up /data/bacu<something>] 14106/67: so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 6 14106/67: setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, 0xFE65ECBC, 4, SOV_DEFAULT) = 0 14106/67: connect(6, 0x080FBEBC, 16, SOV_DEFAULT) = 0 14106/67: AF_INET name = 10.80.117.97 port = 9103 [FD opens connection to the SD] 14106/67: open64("/data/bacula/work/bacula.sql", O_RDONLY) = 7 14106/67: write(6, "\0\0\005 1 2 0", 9) = 9 14106/67: read(7, " - -\n - - P o s t g r".., 65536) = 65536 14106/67: write(6, "\001\0\0 - -\n - - P o".., 65540) = 65540 14106/67: read(7, " B J J F E L B J J F".., 65536) = 65536 14106/67: write(6, "\001\0\0 B J J F E L ".., 65540) = 65540 [The FD opens /data/bacula/work/bacula.sql and passes the contents to the SD] [...] 14106/67: read(7, " 8 1 6 6 7\t 4 4 3\t 5 3".., 65536) = 65536 14106/67: write(6, "\001\0\0 8 1 6 6 7\t 4 4".., 65540) = 65540 14106/67: read(7, " i e A A C\t 2 Z q".., 65536) = 65536 14106/67: write(6, "\001\0\0 i e A A C".., 65540) = 65540 14106/67: read(7, " B J J 1 / q B G Q P L".., 65536) = 65536 14106/2: lwp_park(0xFE76EF2C, 0) (sleeping...) 14106/2: timeout: 29.999928355 sec 14106/68: pollsys(0xFE55FE10, 1, 0xFE55FEC8, 0x00000000) = 0 14106/68: fd=6 ev=POLLRDNORM rev=0 14106/68: timeout: 5.000000000 sec 14106/67: write(6, 0x08114854, 65540) (sleeping...) 14106/68: pollsys(0xFE55FE10, 1, 0xFE55FEC8, 0x00000000) = 1 14106/68: fd=6 ev=POLLRDNORM rev=POLLRDNORM 14106/68: timeout: 5.000000000 sec 14106/67: write(6, "\001\0\0 B J J 1 / q B".., 65540) = 65540 14106/67: read(7, " 6 w P q 8 2 V q 2 3 X n".., 65536) = 65536 14106/68: read(6, 0xFE55FF80, 4) Err#131 ECONNRESET 14106/68: lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] 14106/68: lwp_exit() 14106/67: write(6, "\001\0\0 6 w P q 8 2 V q".., 65540) Err#32 EPIPE 14106/67: Received signal #13, SIGPIPE [ignored] [This is where it goes wrong] Just after half way through the above, this happens: 14106/68: pollsys(0xFE55FE10, 1, 0xFE55FEC8, 0x00000000) = 1 14106/68: fd=6 ev=POLLRDNORM rev=POLLRDNORM 14106/68: timeout: 5.000000000 sec which indicates that a "normal" incoming event has occurred on file descriptor 6, which is the connection to the SD. 3 lines later, 14106/68: read(6, 0xFE55FF80, 4) Err#131 ECONNRESET The FD attempts to read from the SD, and gets "Connection reset by peer". From the job report you posted, it doesn't look like the SD is crashing/restarting, nor is the machine rebooting. Something, somewhere though, is interfering with the connection between the FD and the SD. Sorry to say this, but you may have to truss the SD! Allan ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users