Re: strange problem with FreeBSD 7.3 64bit

Kostik Belousov Fri, 10 Sep 2010 02:23:09 -0700

On Fri, Sep 10, 2010 at 10:45:08AM +0200, freebsd wrote:
> hi list,
> 
> we upgraded some 20 boxes from 7.1 and 7.2 to 7.3-RELEASE-p2 (all amd64) 
> and now are experiencing some weird behaviour on 6 of them with rsnapshot:
> 
> after a few days/several weeks (seems to be completely random), 
> rsnapshot reports that it can't start due it's lockfile and process 
> still being present. on such boxes either a zombie rm or find process 
> (which presumably were launched by rsnapshot) can be found.
> if the backup was done to a separate partition (physical disks or RAIDs) 
> any access (ls, stat, fsck, etc) to the partition would kill the current 
> SSH session, creating a new zombie of the process one just started. 
> unmounting the affected partition would render the server completely 
> unresponsive and required a hardware reset.
> 
> when trying to restart, the machines wouldn't even shut down completely 
> but hanged somewhere after syncing buffers, only a hardware reset 
> worked. after the reboot, those partitions were unmounted and fscked. 
> after which the backups would work again until the next error happened 
> again.
> 
> the hardware of affected and unaffected system are:
> 
> HP ProLiant DL380 G4
> HP ProLiant DL380 G5
> HP ProLiant DL360 G5
> 
> there is no visible pattern between affected and unaffected boxes. also 
> those machines were upgraded the exact same way, running identical 
> kernels (more or less GENERIC, with QUOTA activated).
> 
> we upgraded the most critical boxes which showed that behaviour on a 
> daily interval to 8.0-RELEASE and ever since this behavior has 
> disappeared since nearly 3 months now.
> 
> we installed a debug-kernel on an affected box, but the machine wouldn't 
> panic when the error occured. when trying to unmount the affected 
> partition it just went completely unresponsive, as mentioned above.
> 
> before trying to unmount procstat -ak showed some processes with 
> VOP_LOCK1_APV:
> 
> 55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire 
> _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup 
> vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall
> 70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire 
> _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf 
> ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat
> 
> since this hardware has been working before 7.3 and -- as we assume -- 
> would work again with 8.*, we would be grateful for any hints what could 
> be the cause of all this.
It sounds like a deadlock, but the cause cannot be identified without
further diagnostic. It might be driver (ciss I assume), but may be quota
code, or even something else.


Please follow the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
to obtain the required information.

pgp7ynd7eg2du.pgp
Description: PGP signature

Re: strange problem with FreeBSD 7.3 64bit

Reply via email to