Matt, Thanks for your help. There is not much CPU utilized in any of these runs. CPUs are essentially 95% idle!
The I/O is not busy, particularly nowhere compared to a write (sustained throughput) of 900 MB/sec. You are right, the subsequent 1 minute runs are basically a 100% ARC hit, and arcstats.pl 'read' column hits 50K+. Compared to this, the initial 'find' 30-minute-run was going at sub 1K read/sec rate! I can throw a higher amount of meta cache allocation at this problem but that won't solve the issue, e.g. when I up the meta_data cache allocation from 16GB to 24 GB, the total 32GB RAM unchanged. Now I run a _parallel_ 'dd' write operation to a (fully cached meta data) 'find' will degrade within seconds, from its 1 minute cycle to some verrry slow cycle. That tells me, most of the meta data has been evicted. The fully cached 'find' is now back to its very very slow performance. Therefore, your code patch should help I think. I will try it soon. Thanks for sharing your code. =Lakhinder Date: Sun, 22 Jun 2014 20:14:01 -0700 Subject: Re: [OpenZFS Developer] zfs metadata caching and back-to-back 'find' commands From: [email protected] To: [email protected] CC: [email protected] On Mon, Jun 16, 2014 at 12:42 PM, Lakhinder Walia <[email protected]> wrote: Hi. I have a FreeBSD 10.0-STABLE #0 r264414 based system with 32 GB RAM running ZFS/Geli. I would like to understand if the response pattern to these steps show a bug. Steps (details below): A. Boot the system. Create a large file, 5x the RAM, using dd. B. Run a 'find' on a zfs directory (tree) with approx 32 Million files. C. Run 'find' command _back-to-back_ for a total of 4 times. 'find' response time goes down from: 32 min -> 17 min -> 1+ min -> 17 seconds. (No other significant zfs i/o is going on at this time). Question: Q1. Why are the second and third 'find' back-to-back steps taking so long? MFU should have (AFAIK) held on to the metadata after the first 'find' operation was over. Shouldn't the subsequent 'find' have executed faster!? (There is no other i/o intensive or job running. No NFS clients are present in the tests, even though NFS sharing is configured.) Have you looked at how much I/O and CPU are used in each of the 4 runs? This would help identify if the problem is that we are not caching the data in the ARC, vs. if some other data structure is being torn down and re-created (e.g. vnode_t, znode_t, dnode_t). I would guess that the 17 minute run is I/O bound, but the 1-minute run might be CPU-bound. If it is i/o bound, the next step would be to determine why the data is not cached after the first run. You might start by looking at the arc stats before and after each run. If it is CPU bound, a flame graph might quickly show the problem area. http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html It's possible that after the first run we are not using enough of the ARC to cache metadata (despite the fact that you've raised the metadata limit large enough). I recently did some work to address this issue, which is not yet integrated to illumos or FreeBSD. You can see it here: http://reviews.csiden.org/r/50 Q2. I have another issue, though not detailed here. A ruby script doing a traversal of all the files via its 'Find' api gets to be very very slow, particularly after a new 'find' job kicks in, in-parallel due to some default system security policy: /etc/periodic/security/110.neggrpperm. When I kill the second 'find', the ruby script 'Find' based call continues _still_ at its very slow pace, for hours. A script that is just listing files could have finished in two hours instead takes approx 10 hours even if that is the only file i/o going on. Are both 'find' commands looking at the same directory? If not, the new 'find' command could be displacing some of the cache entries needed by the first command. Steps to analyze this are similar to what I mentioned above -- determine if is CPU or i/o bound, then what is causing that. Q3. Could the two issues above possibly related? I'm not sure. --matt Thanks, =Lakhinder Steps: ==============================================> # Step A. Freshly booted system. root@fbsd_10:/zdata_1/xx # $ dd if=/dev/zero of=file_2 bs=16M count=10240; date 10240+0 records in 10240+0 records out 171798691840 bytes transferred in 192.188421 secs (893907609 bytes/sec) # Step B # cd over to another directory with approx 32 Million files # in its various subdirectories. root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date Sun Jun 15 15:52:32 PDT 2014 32673014 32673014 2378434006 Sun Jun 15 16:24:22 PDT 2014 # time taken is approx. 32 minutes. # Step C-1. root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date Sun Jun 15 16:24:22 PDT 2014 32673014 32673014 2378434006 Sun Jun 15 16:41:56 PDT 2014 # time taken is approx. 17 minutes. # Step C-2. root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date Sun Jun 15 16:41:56 PDT 2014 32673014 32673014 2378434006 Sun Jun 15 16:43:07 PDT 2014 # time taken is approx 1+ minute. # Step C-3. root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date Sun Jun 15 16:43:07 PDT 2014 32673014 32673014 2378434006 Sun Jun 15 16:43:26 PDT 2014 #time taken is 19 seconds. --------------- system details -------------------------------------- OS version: FreeBSD fbsd_10 10.0-STABLE FreeBSD 10.0-STABLE #0 r264414: Sun Apr 13 19:33:56 UTC 2014 [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 cat /boot/loader.conf: aesni_load="YES" geli_load="YES" zfs_load="YES" vfs.zfs.arc_meta_limit=17179869184 # 16 GB limit for ARC Meta. kern.maxvnodes=50000000 Here is 'top' after running a _recent_ 'find'. last pid: 75231; load averages: 0.48, 0.28, 0.33 up 0+20:20:26 11:58:50 37 processes: 1 running, 36 sleeping CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle Mem: 28M Active, 796K Inact, 30G Wired, 2620K Cache, 97M Buf, 911M Free ARC: 24G Total, 4756M MFU, 18G MRU, 16K Anon, 385M Header, 1308M Other Swap: 1638M Total, 41M Used, 1597M Free, 2% Inuse PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 2486 root 1 21 0 34980K 2236K nanslp 6 1:46 0.00% perl5..16.3 2478 root 1 20 0 56300K 15720K select 3 1:14 0.00% tmux 2483 root 1 20 0 19824K 1612K CPU5 5 0:47 0.00% top **** # after the _recent_ refresh with 'find': root@fbsd_10:/ # sysctl vfs.zfs: vfs.zfs.arc_max: 32373334016 vfs.zfs.arc_min: 8589934592 vfs.zfs.arc_meta_used: 17180938800 vfs.zfs.arc_meta_limit: 17179869184 vfs.zfs.l2arc_write_max: 8388608 vfs.zfs.l2arc_write_boost: 8388608 vfs.zfs.l2arc_headroom: 2 vfs.zfs.l2arc_feed_secs: 1 vfs.zfs..l2arc_feed_min_ms: 200 vfs.zfs.l2arc_noprefetch: 1 vfs.zfs.l2arc_feed_again: 1 vfs.zfs.l2arc_norw: 1 vfs.zfs.anon_size: 16384 vfs.zfs.anon_metadata_lsize: 0 vfs.zfs.anon_data_lsize: 0 vfs.zfs.mru_size: 18898130944 vfs.zfs.mru_metadata_lsize: 10115874816 vfs.zfs.mru_data_lsize: 8479965184 vfs.zfs.mru_ghost_size: 6764165120 vfs.zfs.mru_ghost_metadata_lsize: 6764165120 vfs.zfs.mru_ghost_data_lsize: 0 vfs.zfs.mfu_size: 4986851840 vfs.zfs.mfu_metadata_lsize: 4631966208 vfs.zfs.mfu_data_lsize: 0 vfs.zfs.mfu_ghost_size: 2492798976 vfs.zfs.mfu_ghost_metadata_lsize: 2492798976 vfs.zfs.mfu_ghost_data_lsize: 0 vfs.zfs.l2c_only_size: 0 vfs.zfs.dedup.prefetch: 1 vfs.zfs.nopwrite_enabled: 1 vfs.zfs.mdcomp_disable: 0 vfs.zfs.prefetch_disable: 0 vfs.zfs.zfetch.max_streams: 8 vfs.zfs.zfetch.min_sec_reap: 2 vfs.zfs.zfetch.block_cap: 256 vfs.zfs.zfetch.array_rd_sz: 1048576 vfs.zfs.top_maxinflight: 32 vfs.zfs.resilver_delay: 2 vfs.zfs.scrub_delay: 4 vfs.zfs.scan_idle: 50 vfs.zfs.scan_min_time_ms: 1000 vfs.zfs.free_min_time_ms: 1000 vfs.zfs.resilver_min_time_ms: 3000 vfs.zfs.no_scrub_io: 0 vfs.zfs.no_scrub_prefetch: 0 vfs.zfs.metaslab.gang_bang: 131073 vfs.zfs.metaslab.debug_load: 0 vfs.zfs.metaslab.debug_unload: 0 vfs.zfs.metaslab.df_alloc_threshold: 131072 vfs.zfs.metaslab.df_free_pct: 4 vfs.zfs.metaslab.min_alloc_size: 10485760 vfs.zfs.metaslab.load_pct: 50 vfs.zfs.metaslab.unload_delay: 8 vfs.zfs.metaslab.preload_limit: 3 vfs.zfs..metaslab.preload_enabled: 1 vfs.zfs.metaslab.weight_factor_enable: 0 vfs.zfs.condense_pct: 200 vfs.zfs.mg_alloc_failures: 12 vfs.zfs.mg_noalloc_threshold: 0 vfs.zfs.write_to_degraded: 0 vfs.zfs.check_hostid: 1 vfs.zfs.recover: 0 vfs.zfs.deadman_synctime_ms: 1000000 vfs.zfs.deadman_checktime_ms: 5000 vfs.zfs.deadman_enabled: 0 vfs.zfs.spa_asize_inflation: 24 vfs.zfs.txg.timeout: 5 vfs.zfs.vdev.cache.max: 16384 vfs.zfs.vdev.cache.size: 0 vfs.zfs.vdev.cache.bshift: 16 vfs.zfs.vdev.trim_on_init: 1 vfs.zfs.vdev.max_active: 1000 vfs.zfs.vdev.sync_read_min_active: 10 vfs.zfs.vdev.sync_read_max_active: 10 vfs.zfs.vdev.sync_write_min_active: 10 vfs.zfs.vdev.sync_write_max_active: 10 vfs.zfs.vdev.async_read_min_active: 1 vfs.zfs.vdev.async_read_max_active: 3 vfs.zfs.vdev.async_write_min_active: 1 vfs.zfs.vdev.async_write_max_active: 10 vfs.zfs.vdev.scrub_min_active: 1 vfs.zfs.vdev.scrub_max_active: 2 vfs.zfs.vdev.aggregation_limit: 131072 vfs.zfs.vdev.read_gap_limit: 32768 vfs.zfs.vdev.write_gap_limit: 4096 vfs.zfs.vdev.bio_flush_disable: 0 vfs.zfs.vdev.bio_delete_disable: 0 vfs.zfs.vdev.trim_max_bytes: 2147483648 vfs.zfs.vdev.trim_max_pending: 64 vfs.zfs.max_auto_ashift: 13 vfs.zfs.zil_replay_disable: 0 vfs.zfs.cache_flush_disable: 0 vfs.zfs.zio.use_uma: 1 vfs.zfs.zio.exclude_metadata: 0 vfs.zfs.sync_pass_deferred_free: 2 vfs.zfs.sync_pass_dont_compress: 5 vfs..zfs.sync_pass_rewrite: 2 vfs.zfs.snapshot_list_prefetch: 0 vfs.zfs.super_owner: 0 vfs.zfs.debug: 0 vfs.zfs.version.ioctl: 3 vfs.zfs.version.acl: 1 vfs.zfs.version.spa: 5000 vfs.zfs.version.zpl: 5 vfs.zfs.trim.enabled: 1 vfs.zfs.trim.txg_delay: 32 vfs.zfs.trim.timeout: 30 vfs.zfs.trim.max_interval: 1 _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
