On Fri, Sep 10, 2010 at 10:45:08AM +0200, freebsd wrote: > hi list, > > we upgraded some 20 boxes from 7.1 and 7.2 to 7.3-RELEASE-p2 (all amd64) > and now are experiencing some weird behaviour on 6 of them with rsnapshot: > > after a few days/several weeks (seems to be completely random), > rsnapshot reports that it can't start due it's lockfile and process > still being present. on such boxes either a zombie rm or find process > (which presumably were launched by rsnapshot) can be found. > if the backup was done to a separate partition (physical disks or RAIDs) > any access (ls, stat, fsck, etc) to the partition would kill the current > SSH session, creating a new zombie of the process one just started. > unmounting the affected partition would render the server completely > unresponsive and required a hardware reset. > > when trying to restart, the machines wouldn't even shut down completely > but hanged somewhere after syncing buffers, only a hardware reset > worked. after the reboot, those partitions were unmounted and fscked. > after which the backups would work again until the next error happened > again. > > the hardware of affected and unaffected system are: > > HP ProLiant DL380 G4 > HP ProLiant DL380 G5 > HP ProLiant DL360 G5 > > there is no visible pattern between affected and unaffected boxes. also > those machines were upgraded the exact same way, running identical > kernels (more or less GENERIC, with QUOTA activated). > > we upgraded the most critical boxes which showed that behaviour on a > daily interval to 8.0-RELEASE and ever since this behavior has > disappeared since nearly 3 months now. > > we installed a debug-kernel on an affected box, but the machine wouldn't > panic when the error occured. when trying to unmount the affected > partition it just went completely unresponsive, as mentioned above. > > before trying to unmount procstat -ak showed some processes with > VOP_LOCK1_APV: > > 55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire > _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup > vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall > 70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire > _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf > ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat > > since this hardware has been working before 7.3 and -- as we assume -- > would work again with 8.*, we would be grateful for any hints what could > be the cause of all this. It sounds like a deadlock, but the cause cannot be identified without further diagnostic. It might be driver (ciss I assume), but may be quota code, or even something else.
Please follow the http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html to obtain the required information.
pgp7ynd7eg2du.pgp
Description: PGP signature
