Hi Everyone,

I have a couple of systems running opensolaris b118, one of which sends hourly snapshots to the other. This has been working well, however as of today, the receiving zfs process has started running extremely slowly, and is running at 100% CPU on one core, completely in kernel mode. A little bit of exploration with lockstat and dtrace seems to imply that the issue is around the "dbuf_free_range" function - or at least, that's what it looks like to my inexperienced eye!

The system is very unresponsive while this problem is occurring, with frequent multi-second delays between my typing into an ssh session and getting a response.

Has anyone seen this before, or know if there's anything I can do about it?

Lockstat says:

Adaptive mutex hold: 52299 events in 1.694 seconds (30873 events/sec)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
   50  87%  87% 0.00 23410134 0xffffff0273d6dcb8     dbuf_free_range+0x268

      nsec ------ Time Distribution ------ count
  33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 50
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
   53   0%  87% 0.00   117124 0xffffff025f404140     txg_rele_to_quiesce+0x18

      nsec ------ Time Distribution ------ count
    131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 53
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
 2536   0%  88% 0.00     2194 0xffffff024f09bd40     kmem_cache_alloc+0x84

      nsec ------ Time Distribution ------ count
      2048 |@@@@@@@@@@@@@                  1121
      4096 |@@@@@@@@@@@@@@@@               1383
      8192 |                               0
     16384 |                               32
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
  415   0%  88% 0.00     7968 0xffffff025120ce00     anon_resvmem+0xb4

[snip]

I modified the "procsystime" script from the dtrace toolkit to trace fbt:zfs::entry/return rather than syscalls, and I get an output like this (approximately a 5sec sample):
Function                        nsecs
l2arc_write_eligible           46155456
spa_last_synced_txg           49211711
zio_inherit_child_errors           50039132
      zil_commit           50137305
        buf_hash           50839858
       dbuf_rele           56388970
zio_wait_for_children           67533094
dbuf_update_data           70302317
dsl_dir_space_towrite           72179496
    parent_delta           77685327
       dbuf_hash           79075532
free_range_compar           82479196
 dbuf_free_range         2715449416
          TOTAL:         4156078240

Thanks,
    Tristan


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to