Prakash, thanks for sharing your code.  I am glad I am not the only one who has 
hit the issue!
I will look at Zfsonlinux source tree for other changes as well. 
=Lakhinder

> Date: Mon, 23 Jun 2014 13:46:32 -0700
> From: [email protected]
> To: [email protected]
> CC: [email protected]; [email protected]
> Subject: Re: [OpenZFS Developer] zfs metadata caching and back-to-back 'find' 
> commands
> 
> On Sun, Jun 22, 2014 at 08:14:01PM -0700, Matthew Ahrens wrote:
> > On Mon, Jun 16, 2014 at 12:42 PM, Lakhinder Walia <[email protected]> 
> > wrote:
> >      Hi.
> > 
> >      I have a FreeBSD 10.0-STABLE #0 r264414 based system with 32 GB RAM
> >      running ZFS/Geli.
> > 
> >      I would like to understand if the response pattern to these steps
> >      show a bug.
> > 
> >      Steps (details below):
> > 
> >      A. Boot the system. Create a large file, 5x the RAM, using dd.
> > 
> >      B. Run a 'find' on a zfs directory (tree) with approx 32 Million
> >      files.
> > 
> >      C. Run 'find' command _back-to-back_ for a total of 4 times.
> > 
> >      'find' response time goes down from:
> >       32 min -> 17 min -> 1+ min -> 17 seconds.
> >      (No other significant zfs i/o is going on at this time).
> > 
> >      Question:
> > 
> >      Q1. Why are the second and third 'find' back-to-back steps taking so
> >      long?
> >      MFU should have (AFAIK) held on to the metadata after the first
> >      'find'
> >      operation was over. Shouldn't the subsequent 'find' have executed
> >      faster!?
> >      (There is no other i/o intensive or job running. No NFS clients are
> >      present in the tests, even though NFS sharing is configured.)
> > 
> > Have you looked at how much I/O and CPU are used in each of the 4 runs?  
> > This
> > would help identify if the problem is that we are not caching the data in 
> > the
> > ARC, vs. if some other data structure is being torn down and re-created 
> > (e.g.
> > vnode_t, znode_t, dnode_t).  I would guess that the 17 minute run is I/O 
> > bound,
> > but the 1-minute run might be CPU-bound.
> > 
> > If it is i/o bound, the next step would be to determine why the data is not
> > cached after the first run.  You might start by looking at the arc stats 
> > before
> > and after each run.
> > 
> > If it is CPU bound, a flame graph might quickly show the problem area..  
> > http://
> > www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
> > 
> > It's possible that after the first run we are not using enough of the ARC to
> > cache metadata (despite the fact that you've raised the metadata limit large
> > enough).  I recently did some work to address this issue, which is not yet
> > integrated to illumos or FreeBSD.  You can see it here:
> > 
> > http://reviews.csiden.org/r/50
> 
> We've done work in ZoL to address this as well..
> 
>     https://github.com/zfsonlinux/zfs/commit/da8ccd0e
> 
> -- 
> Cheers, Prakash
> 
> >  
> >       
> >      Q2. I have another issue, though not detailed here. A ruby script
> >      doing a traversal of all the files via its 'Find' api gets to be very
> >      very slow, particularly after a new 'find' job kicks in, in-parallel
> >      due to some default system security policy:
> >      /etc/periodic/security/110.neggrpperm. When I kill the second 'find',
> >      the ruby script 'Find' based call continues _still_ at its very slow
> >      pace, for hours. A script that is just listing files could have
> >      finished in two hours instead takes approx 10 hours even if that is
> >      the
> >      only file i/o going on.
> > 
> > Are both 'find' commands looking at the same directory?  If not, the new 
> > 'find'
> > command could be displacing some of the cache entries needed by the first
> > command.
> > 
> > Steps to analyze this are similar to what I mentioned above -- determine if 
> > is
> > CPU or i/o bound, then what is causing that.
> > 
> > 
> >      Q3. Could the two issues above possibly related?
> > 
> > I'm not sure.
> > 
> > --matt
> >  
> > 
> >      Thanks,
> > 
> >      =Lakhinder
> >       
> > 
> > 
> >      Steps: ==============================================>
> > 
> >      #  Step A.  Freshly booted system.             
> >      root@fbsd_10:/zdata_1/xx #
> >      $ dd if=/dev/zero of=file_2 bs=16M count=10240; date  
> >      10240+0 records in
> >      10240+0 records out
> >      171798691840 bytes transferred in 192.188421 secs (893907609 bytes/
> >      sec)
> > 
> >      # Step B
> >      # cd over to another directory with approx 32 Million files
> >      # in its various subdirectories.
> >      root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
> >      Sun Jun 15 15:52:32 PDT 2014
> >       32673014 32673014 2378434006
> >      Sun Jun 15 16:24:22 PDT 2014
> >      # time taken is approx. 32 minutes.
> > 
> >      # Step C-1.
> >      root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
> >      Sun Jun 15 16:24:22 PDT 2014
> >       32673014 32673014 2378434006
> >      Sun Jun 15 16:41:56 PDT 2014
> >      # time taken is approx. 17 minutes.
> > 
> >      # Step C-2.
> >      root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
> >      Sun Jun 15 16:41:56 PDT 2014
> >       32673014 32673014 2378434006
> >      Sun Jun 15 16:43:07 PDT 2014
> >      # time taken is approx 1+ minute.
> > 
> >      # Step C-3.
> >      root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
> >      Sun Jun 15 16:43:07 PDT 2014
> >       32673014 32673014 2378434006
> >      Sun Jun 15 16:43:26 PDT 2014
> >      #time taken is 19 seconds.
> > 
> >      --------------- system details --------------------------------------
> > 
> >      OS version:
> >      FreeBSD fbsd_10 10.0-STABLE FreeBSD 10.0-STABLE #0 r264414:
> >      Sun Apr 13 19:33:56 UTC 2014    
> >      [email protected]:/usr/obj/usr/src/sys/GENERIC  amd64
> > 
> >      cat /boot/loader.conf:
> >      aesni_load=&quot;YES&quot;
> >      geli_load=&quot;YES&quot;
> >      zfs_load=&quot;YES&quot;
> >      vfs.zfs.arc_meta_limit=17179869184
> >      # 16 GB limit for ARC Meta.
> >      kern.maxvnodes=50000000
> > 
> > 
> >      Here is 'top' after running a _recent_ 'find'.
> > 
> >      last pid: 75231;  load averages:  0.48,  0.28,  0.33                 
> >      up 0+20:20:26  11:58:50
> >      37 processes:  1 running, 36 sleeping
> >      CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9%
> >      idle
> >      Mem: 28M Active, 796K Inact, 30G Wired, 2620K Cache, 97M Buf, 911M
> >      Free
> >      ARC: 24G Total, 4756M MFU, 18G MRU, 16K Anon, 385M Header, 1308M
> >      Other
> >      Swap: 1638M Total, 41M Used, 1597M Free, 2% Inuse
> > 
> >        PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
> >      COMMAND
> >       2486 root          1  21    0 34980K  2236K nanslp  6   1:46   0.00%
> >      perl5.16.3
> >       2478 root          1  20    0 56300K 15720K select  3   1:14   0.00%
> >      tmux
> >       2483 root          1  20    0 19824K  1612K CPU5    5   0:47   0.00%
> >      top
> >       ****
> > 
> >      # after the _recent_ refresh with 'find':
> >      root@fbsd_10:/ # sysctl vfs.zfs:
> > 
> >      vfs.zfs.arc_max: 32373334016
> >      vfs.zfs.arc_min: 8589934592
> >      vfs.zfs.arc_meta_used: 17180938800
> >      vfs.zfs.arc_meta_limit: 17179869184
> >      vfs.zfs.l2arc_write_max: 8388608
> >      vfs.zfs.l2arc_write_boost: 8388608
> >      vfs.zfs.l2arc_headroom: 2
> >      vfs.zfs.l2arc_feed_secs: 1
> >      vfs.zfs..l2arc_feed_min_ms: 200
> >      vfs.zfs.l2arc_noprefetch: 1
> >      vfs.zfs.l2arc_feed_again: 1
> >      vfs.zfs.l2arc_norw: 1
> >      vfs.zfs.anon_size: 16384
> >      vfs.zfs.anon_metadata_lsize: 0
> >      vfs.zfs.anon_data_lsize: 0
> >      vfs.zfs.mru_size: 18898130944
> >      vfs.zfs.mru_metadata_lsize: 10115874816
> >      vfs.zfs.mru_data_lsize: 8479965184
> >      vfs.zfs.mru_ghost_size: 6764165120
> >      vfs.zfs.mru_ghost_metadata_lsize: 6764165120
> >      vfs.zfs.mru_ghost_data_lsize: 0
> >      vfs.zfs.mfu_size: 4986851840
> >      vfs.zfs.mfu_metadata_lsize: 4631966208
> >      vfs.zfs.mfu_data_lsize: 0
> >      vfs.zfs.mfu_ghost_size: 2492798976
> >      vfs.zfs.mfu_ghost_metadata_lsize: 2492798976
> >      vfs.zfs.mfu_ghost_data_lsize: 0
> >      vfs.zfs.l2c_only_size: 0
> >      vfs.zfs.dedup.prefetch: 1
> >      vfs.zfs.nopwrite_enabled: 1
> >      vfs.zfs.mdcomp_disable: 0
> >      vfs.zfs.prefetch_disable: 0
> >      vfs.zfs.zfetch.max_streams: 8
> >      vfs.zfs.zfetch.min_sec_reap: 2
> >      vfs.zfs.zfetch.block_cap: 256
> >      vfs.zfs.zfetch.array_rd_sz: 1048576
> >      vfs.zfs.top_maxinflight: 32
> >      vfs.zfs.resilver_delay: 2
> >      vfs.zfs.scrub_delay: 4
> >      vfs.zfs.scan_idle: 50
> >      vfs.zfs.scan_min_time_ms: 1000
> >      vfs.zfs.free_min_time_ms: 1000
> >      vfs.zfs.resilver_min_time_ms: 3000
> >      vfs.zfs.no_scrub_io: 0
> >      vfs.zfs.no_scrub_prefetch: 0
> >      vfs.zfs.metaslab.gang_bang: 131073
> >      vfs.zfs.metaslab.debug_load: 0
> >      vfs..zfs.metaslab.debug_unload: 0
> >      vfs.zfs.metaslab.df_alloc_threshold: 131072
> >      vfs.zfs.metaslab.df_free_pct: 4
> >      vfs.zfs.metaslab.min_alloc_size: 10485760
> >      vfs.zfs.metaslab.load_pct: 50
> >      vfs.zfs.metaslab.unload_delay: 8
> >      vfs.zfs.metaslab.preload_limit: 3
> >      vfs.zfs..metaslab.preload_enabled: 1
> >      vfs.zfs.metaslab.weight_factor_enable: 0
> >      vfs.zfs.condense_pct: 200
> >      vfs.zfs.mg_alloc_failures: 12
> >      vfs.zfs.mg_noalloc_threshold: 0
> >      vfs.zfs.write_to_degraded: 0
> >      vfs.zfs.check_hostid: 1
> >      vfs.zfs.recover: 0
> >      vfs.zfs.deadman_synctime_ms: 1000000
> >      vfs.zfs.deadman_checktime_ms: 5000
> >      vfs.zfs.deadman_enabled: 0
> >      vfs.zfs.spa_asize_inflation: 24
> >      vfs.zfs.txg.timeout: 5
> >      vfs.zfs.vdev.cache.max: 16384
> >      vfs.zfs.vdev.cache.size: 0
> >      vfs.zfs.vdev.cache.bshift: 16
> >      vfs.zfs.vdev.trim_on_init: 1
> >      vfs.zfs.vdev.max_active: 1000
> >      vfs.zfs.vdev.sync_read_min_active: 10
> >      vfs.zfs.vdev.sync_read_max_active: 10
> >      vfs.zfs.vdev.sync_write_min_active: 10
> >      vfs.zfs.vdev.sync_write_max_active: 10
> >      vfs.zfs.vdev.async_read_min_active: 1
> >      vfs.zfs.vdev.async_read_max_active: 3
> >      vfs.zfs.vdev.async_write_min_active: 1
> >      vfs.zfs.vdev.async_write_max_active: 10
> >      vfs.zfs.vdev.scrub_min_active: 1
> >      vfs.zfs.vdev.scrub_max_active: 2
> >      vfs.zfs.vdev.aggregation_limit: 131072
> >      vfs.zfs.vdev.read_gap_limit: 32768
> >      vfs.zfs.vdev.write_gap_limit: 4096
> >      vfs.zfs.vdev.bio_flush_disable: 0
> >      vfs.zfs.vdev.bio_delete_disable: 0
> >      vfs.zfs.vdev.trim_max_bytes: 2147483648
> >      vfs.zfs.vdev.trim_max_pending: 64
> >      vfs.zfs.max_auto_ashift: 13
> >      vfs.zfs..zil_replay_disable: 0
> >      vfs.zfs.cache_flush_disable: 0
> >      vfs.zfs.zio.use_uma: 1
> >      vfs.zfs.zio.exclude_metadata: 0
> >      vfs.zfs.sync_pass_deferred_free: 2
> >      vfs.zfs.sync_pass_dont_compress: 5
> >      vfs..zfs.sync_pass_rewrite: 2
> >      vfs.zfs.snapshot_list_prefetch: 0
> >      vfs.zfs.super_owner: 0
> >      vfs.zfs.debug: 0
> >      vfs.zfs.version.ioctl: 3
> >      vfs.zfs.version.acl: 1
> >      vfs.zfs.version.spa: 5000
> >      vfs.zfs.version.zpl: 5
> >      vfs.zfs.trim.enabled: 1
> >      vfs.zfs.trim.txg_delay: 32
> >      vfs.zfs.trim.timeout: 30
> >      vfs.zfs.trim.max_interval: 1
> > 
> >      _______________________________________________
> >      developer mailing list
> >      [email protected]
> >      http://lists.open-zfs.org/mailman/listinfo/developer
> > 
> > 
> 
> > _______________________________________________
> > developer mailing list
> > [email protected]
> > http://lists.open-zfs.org/mailman/listinfo/developer
> 
                                          
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to