On Fri, Sep 17, 2010 at 02:02:10PM -0500, Les Mikesell wrote:
> On 9/17/2010 1:10 PM, John Rouillard wrote:
> > I mention this since there seem to have been a few other mentions of
> > hangs over the years and this may get somebody past the problem.
> >
> > Also I am hoping that somebody can figure out what is happening
> > here. It seems that some state in the prior (reference) backup is
> > causing the rsync protocol to stall.
> >
> > So anybody with a bright idea of what I can try looking at?
> 
> I think most of the stalls were either buggy cygwin/windows versions or 
> some stateful firewall/nat network device in the path that time out and 
> break the connection between devices in the long idle times you might 
> have in a backup run with mostly-identical files.   If neither of these 
> are possible, maybe you have filesystem corruption of some kind.

All good ideas. They also remind me that I forgot to supply some info
this time around.

This is centos 5.5 to centos 5.5 with kernel 2.6.18-194.3.1.el5 on the
server and 2.6.16-xenU on the client. However this has happened in the
past with the same 2.6.18 kernel on both (real) boxes. Also this
eliminates the whole windows morass.

The backup is occurring over a vpn w/o any firewalls/nat. Also we have
ServerAlive messages enabled every 30 seconds for the ssh session
(because the route to some of the hosts we back up do have stateful
firewalls in place). I can see ssh traffic using tcpdump when the
rsync is stalled which tells me that the network/ssh layer is fine and
the rsync protocol is wacky.

Disk corruption isn't impossible. However the filesystem is a 4.5TB
ext3 on top of 2 software (md) raid 6 arrays with 7 disks that are
striped (raid 0) together. Forcing an fsck's in the past hasn't turned
up any issues (but does take backups offline for a long bit 8-(). The
arrays are scrubbed weekly and disk selftests (using smartctl) are
done monthly.

When I have increased the logging level in the past to try to diagnose
this, no obvious errors popped up. It proceeded normally until it just
kind of stopped. Then there was the sigalarm notice.

-- 
                                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to