Hello,

I have observed a deadlock condition when using ZFS. We are making a heavy usage of zfs send/zfs receive to keep a replica of a dataset on a remote machine. It can be done at one minute intervals. Maybe we're doing a somehow atypical usage of ZFS, but, well, seems to be a great solution to keep filesystem replicas once this is sorted out.


How to reproduce:

Set up two systems. A dataset with heavy I/O activity is replicated from the first to the second one. I've used a dataset containing /usr/ obj while I did a make buildworld.

Replicate the dataset from the first machine to the second one using an incremental send

zfs send -i pool/data...@nminus1 pool/data...@n | ssh destination zfs receive -d pool

When there is read activity on the second system, reading the replicated system, I mean, having read access while zfs receive is updating it, there can be a deadlock. We have discovered this doing a test on a hopefully soon in production server, with 8 GB RAM. A Bacula backup agent was running and ZFS deadlocked.

I have set up a couple of VMWare Fussion virtual machines in order to test this, and it has deadlocked as well. The virtual machines have little memory, 512 MB, but I don't believe this is the actual problem. There is no complaint about lack of memory.

A running top shows processes stuck on "zfsvfs"

last pid: 2051; load averages: 0.00, 0.07, 0.55 up 0+01:18:25 12:05:48
37 processes:  1 running, 36 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 18M Active, 20M Inact, 114M Wired, 40K Cache, 59M Buf, 327M Free
Swap: 1024M Total, 1024M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 1914 root 1 62 0 11932K 2564K zfsvfs 0 0:51 0.00% bsdtar
 1093 borjam      1  44    0  8304K  2464K CPU1    1   0:32  0.00% top
1913 root 1 54 0 11932K 2600K rrl->r 0 0:19 0.00% bsdtar
 1019 root        1  44    0 25108K  4812K select  0   0:05  0.00% sshd
 2008 root        1  76    0 13600K  1904K tx->tx  0   0:04  0.00% zfs
 1089 borjam      1  44    0 37040K  5216K select  1   0:04  0.00% sshd
  995 root        1  76    0  8252K  2652K pause   0   0:02  0.00% csh
840 root 1 44 0 11044K 3828K select 1 0:02 0.00% sendmail
 1086 root        1  76    0 37040K  5156K sbwait  1   0:01  0.00% sshd
  850 root        1  44    0  6920K  1612K nanslp  0   0:01  0.00% cron
607 root 1 44 0 5992K 1540K select 1 0:01 0.00% syslogd
 1090 borjam      1  76    0  8252K  2636K pause   1   0:01  0.00% csh
  990 borjam      1  44    0 37040K  5220K select  0   0:00  0.00% sshd
  985 root        1  48    0 37040K  5160K sbwait  1   0:00  0.00% sshd
  911 root        1  44    0  8252K  2608K ttyin   0   0:00  0.00% csh
  991 borjam      1  56    0  8252K  2636K pause   0   0:00  0.00% csh
844 smmsp 1 46 0 11044K 3852K pause 0 0:00 0.00% sendmail

Interestingly, this has blocked access to all the filesystems. I cannot, for instance, ssh into the machine anymore, even though all the system-important filesystems are on ufs, I was just using ZFS for a test.

Any ideas on what information might be useful to collect? I have the vmware machine right now. I've made a couple of VMWare snapshots of it, first before breaking into DDB with the deadlock just started, the second being into DDB (I've broken into DDB with sysctl).

Also, a copy of the VMWare virtual machine with snapshots is avaiable on request. Your choice ;)






Borja.


_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Reply via email to