On Fri, Sep 17, 2010 at 02:02:10PM -0500, Les Mikesell wrote: > On 9/17/2010 1:10 PM, John Rouillard wrote: > > I mention this since there seem to have been a few other mentions of > > hangs over the years and this may get somebody past the problem. > > > > Also I am hoping that somebody can figure out what is happening > > here. It seems that some state in the prior (reference) backup is > > causing the rsync protocol to stall. > > > > So anybody with a bright idea of what I can try looking at? > > I think most of the stalls were either buggy cygwin/windows versions or > some stateful firewall/nat network device in the path that time out and > break the connection between devices in the long idle times you might > have in a backup run with mostly-identical files. If neither of these > are possible, maybe you have filesystem corruption of some kind.
All good ideas. They also remind me that I forgot to supply some info this time around. This is centos 5.5 to centos 5.5 with kernel 2.6.18-194.3.1.el5 on the server and 2.6.16-xenU on the client. However this has happened in the past with the same 2.6.18 kernel on both (real) boxes. Also this eliminates the whole windows morass. The backup is occurring over a vpn w/o any firewalls/nat. Also we have ServerAlive messages enabled every 30 seconds for the ssh session (because the route to some of the hosts we back up do have stateful firewalls in place). I can see ssh traffic using tcpdump when the rsync is stalled which tells me that the network/ssh layer is fine and the rsync protocol is wacky. Disk corruption isn't impossible. However the filesystem is a 4.5TB ext3 on top of 2 software (md) raid 6 arrays with 7 disks that are striped (raid 0) together. Forcing an fsck's in the past hasn't turned up any issues (but does take backups offline for a long bit 8-(). The arrays are scrubbed weekly and disk selftests (using smartctl) are done monthly. When I have increased the logging level in the past to try to diagnose this, no obvious errors popped up. It proceeded normally until it just kind of stopped. Then there was the sigalarm notice. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111 ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/