Matt,  Thanks for your help.

There is not much CPU utilized in any of these runs. CPUs are essentially 95% 
idle! 

The I/O is not busy, particularly nowhere compared to a write (sustained 
throughput) of 900 MB/sec.

You are right, the subsequent 1 minute runs are basically a 100% ARC hit, and 
arcstats.pl 'read' column hits 50K+. Compared to this, the initial 'find' 
30-minute-run was going at sub 1K read/sec rate!

I can throw a higher amount of meta cache allocation at this problem but that 
won't solve the issue, e.g. when I up the meta_data cache allocation from 16GB 
to 24 GB, the total 32GB RAM unchanged. Now I run a _parallel_ 'dd' write 
operation to a (fully cached meta data) 'find' will degrade within seconds, 
from its 1 minute cycle to some verrry slow cycle. That tells me, most of the 
meta data has been evicted. The fully cached 'find' is now back to its very 
very slow performance.  Therefore, your code patch should help I think. I will 
try it soon. 

Thanks for sharing your code.

=Lakhinder



Date: Sun, 22 Jun 2014 20:14:01 -0700
Subject: Re: [OpenZFS Developer] zfs metadata caching and back-to-back 'find' 
commands
From: [email protected]
To: [email protected]
CC: [email protected]

On Mon, Jun 16, 2014 at 12:42 PM, Lakhinder Walia <[email protected]> wrote:




Hi.

I have a FreeBSD 10.0-STABLE #0 r264414 based system with 32 GB RAM running 
ZFS/Geli.

I would like to understand if the response pattern to these steps show a bug.


Steps (details below): 

A. Boot the system. Create a large file, 5x the RAM, using dd.

B. Run a 'find' on a zfs directory (tree) with approx 32 Million files.

C. Run 'find' command _back-to-back_ for a total of 4 times.


'find' response time goes down from:
 32 min -> 17 min -> 1+ min -> 17 seconds.
(No other significant zfs i/o is going on at this time).

Question: 

Q1. Why are the second and third 'find' back-to-back steps taking so long? 

MFU should have (AFAIK) held on to the metadata after the first 'find' 
operation was over. Shouldn't the subsequent 'find' have executed faster!?
(There is no other i/o intensive or job running. No NFS clients are

present in the tests, even though NFS sharing is configured.)

Have you looked at how much I/O and CPU are used in each of the 4 runs?  This 
would help identify if the problem is that we are not caching the data in the 
ARC, vs. if some other data structure is being torn down and re-created (e.g. 
vnode_t, znode_t, dnode_t).  I would guess that the 17 minute run is I/O bound, 
but the 1-minute run might be CPU-bound.

If it is i/o bound, the next step would be to determine why the data is not 
cached after the first run.  You might start by looking at the arc stats before 
and after each run.

If it is CPU bound, a flame graph might quickly show the problem area.  
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

It's possible that after the first run we are not using enough of the ARC to 
cache metadata (despite the fact that you've raised the metadata limit large 
enough).  I recently did some work to address this issue, which is not yet 
integrated to illumos or FreeBSD.  You can see it here:

http://reviews.csiden.org/r/50 
 
Q2. I have another issue, though not detailed here. A ruby script
doing a traversal of all the files via its 'Find' api gets to be very
very slow, particularly after a new 'find' job kicks in, in-parallel

due to some default system security policy:
/etc/periodic/security/110.neggrpperm. When I kill the second 'find',
the ruby script 'Find' based call continues _still_ at its very slow
pace, for hours. A script that is just listing files could have

finished in two hours instead takes approx 10 hours even if that is the 
only file i/o going on.

Are both 'find' commands looking at the same directory?  If not, the new 'find' 
command could be displacing some of the cache entries needed by the first 
command.

Steps to analyze this are similar to what I mentioned above -- determine if is 
CPU or i/o bound, then what is causing that.


Q3. Could the two issues above possibly related?

I'm not sure.
--matt 

Thanks,

=Lakhinder
 


Steps: ==============================================>

#  Step A.  Freshly booted system.              
root@fbsd_10:/zdata_1/xx #

$ dd if=/dev/zero of=file_2 bs=16M count=10240; date   
10240+0 records in
10240+0 records out
171798691840 bytes transferred in 192.188421 secs (893907609 bytes/sec)

# Step B
# cd over to another directory with approx 32 Million files 

# in its various subdirectories.
root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
Sun Jun 15 15:52:32 PDT 2014
 32673014 32673014 2378434006
Sun Jun 15 16:24:22 PDT 2014
# time taken is approx. 32 minutes.


# Step C-1.
root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
Sun Jun 15 16:24:22 PDT 2014
 32673014 32673014 2378434006
Sun Jun 15 16:41:56 PDT 2014
# time taken is approx. 17 minutes.

# Step C-2.

root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date
Sun Jun 15 16:41:56 PDT 2014
 32673014 32673014 2378434006
Sun Jun 15 16:43:07 PDT 2014
# time taken is approx 1+ minute.

# Step C-3.
root@fbsd_10:/zdata_1/sr_2gig # date; find . | wc; date

Sun Jun 15 16:43:07 PDT 2014
 32673014 32673014 2378434006
Sun Jun 15 16:43:26 PDT 2014
#time taken is 19 seconds.

--------------- system details --------------------------------------

OS version:

FreeBSD fbsd_10 10.0-STABLE FreeBSD 10.0-STABLE #0 r264414: 
Sun Apr 13 19:33:56 UTC 2014     
[email protected]:/usr/obj/usr/src/sys/GENERIC  amd64

cat /boot/loader.conf:
aesni_load="YES"

geli_load="YES"
zfs_load="YES"
vfs.zfs.arc_meta_limit=17179869184
# 16 GB limit for ARC Meta.
kern.maxvnodes=50000000



Here is 'top' after running a _recent_ 'find'. 

last pid: 75231;  load averages:  0.48,  0.28,  0.33                  up 
0+20:20:26  11:58:50
37 processes:  1 running, 36 sleeping
CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle

Mem: 28M Active, 796K Inact, 30G Wired, 2620K Cache, 97M Buf, 911M Free
ARC: 24G Total, 4756M MFU, 18G MRU, 16K Anon, 385M Header, 1308M Other
Swap: 1638M Total, 41M Used, 1597M Free, 2% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND

 2486 root          1  21    0 34980K  2236K nanslp  6   1:46   0.00% 
perl5..16.3
 2478 root          1  20    0 56300K 15720K select  3   1:14   0.00% tmux
 2483 root          1  20    0 19824K  1612K CPU5    5   0:47   0.00% top

 ****

# after the _recent_ refresh with 'find':
root@fbsd_10:/ # sysctl vfs.zfs: 

vfs.zfs.arc_max: 32373334016
vfs.zfs.arc_min: 8589934592

vfs.zfs.arc_meta_used: 17180938800
vfs.zfs.arc_meta_limit: 17179869184
vfs.zfs.l2arc_write_max: 8388608
vfs.zfs.l2arc_write_boost: 8388608
vfs.zfs.l2arc_headroom: 2

vfs.zfs.l2arc_feed_secs: 1
vfs.zfs..l2arc_feed_min_ms: 200
vfs.zfs.l2arc_noprefetch: 1
vfs.zfs.l2arc_feed_again: 1
vfs.zfs.l2arc_norw: 1
vfs.zfs.anon_size: 16384
vfs.zfs.anon_metadata_lsize: 0
vfs.zfs.anon_data_lsize: 0

vfs.zfs.mru_size: 18898130944
vfs.zfs.mru_metadata_lsize: 10115874816
vfs.zfs.mru_data_lsize: 8479965184
vfs.zfs.mru_ghost_size: 6764165120
vfs.zfs.mru_ghost_metadata_lsize: 6764165120

vfs.zfs.mru_ghost_data_lsize: 0
vfs.zfs.mfu_size: 4986851840
vfs.zfs.mfu_metadata_lsize: 4631966208
vfs.zfs.mfu_data_lsize: 0
vfs.zfs.mfu_ghost_size: 2492798976

vfs.zfs.mfu_ghost_metadata_lsize: 2492798976
vfs.zfs.mfu_ghost_data_lsize: 0
vfs.zfs.l2c_only_size: 0
vfs.zfs.dedup.prefetch: 1
vfs.zfs.nopwrite_enabled: 1

vfs.zfs.mdcomp_disable: 0
vfs.zfs.prefetch_disable: 0
vfs.zfs.zfetch.max_streams: 8
vfs.zfs.zfetch.min_sec_reap: 2
vfs.zfs.zfetch.block_cap: 256
vfs.zfs.zfetch.array_rd_sz: 1048576
vfs.zfs.top_maxinflight: 32

vfs.zfs.resilver_delay: 2
vfs.zfs.scrub_delay: 4
vfs.zfs.scan_idle: 50
vfs.zfs.scan_min_time_ms: 1000
vfs.zfs.free_min_time_ms: 1000
vfs.zfs.resilver_min_time_ms: 3000
vfs.zfs.no_scrub_io: 0
vfs.zfs.no_scrub_prefetch: 0

vfs.zfs.metaslab.gang_bang: 131073
vfs.zfs.metaslab.debug_load: 0
vfs.zfs.metaslab.debug_unload: 0
vfs.zfs.metaslab.df_alloc_threshold: 131072
vfs.zfs.metaslab.df_free_pct: 4
vfs.zfs.metaslab.min_alloc_size: 10485760

vfs.zfs.metaslab.load_pct: 50
vfs.zfs.metaslab.unload_delay: 8
vfs.zfs.metaslab.preload_limit: 3
vfs.zfs..metaslab.preload_enabled: 1
vfs.zfs.metaslab.weight_factor_enable: 0
vfs.zfs.condense_pct: 200
vfs.zfs.mg_alloc_failures: 12

vfs.zfs.mg_noalloc_threshold: 0
vfs.zfs.write_to_degraded: 0
vfs.zfs.check_hostid: 1
vfs.zfs.recover: 0
vfs.zfs.deadman_synctime_ms: 1000000
vfs.zfs.deadman_checktime_ms: 5000
vfs.zfs.deadman_enabled: 0

vfs.zfs.spa_asize_inflation: 24
vfs.zfs.txg.timeout: 5
vfs.zfs.vdev.cache.max: 16384
vfs.zfs.vdev.cache.size: 0
vfs.zfs.vdev.cache.bshift: 16
vfs.zfs.vdev.trim_on_init: 1
vfs.zfs.vdev.max_active: 1000

vfs.zfs.vdev.sync_read_min_active: 10
vfs.zfs.vdev.sync_read_max_active: 10
vfs.zfs.vdev.sync_write_min_active: 10
vfs.zfs.vdev.sync_write_max_active: 10
vfs.zfs.vdev.async_read_min_active: 1
vfs.zfs.vdev.async_read_max_active: 3

vfs.zfs.vdev.async_write_min_active: 1
vfs.zfs.vdev.async_write_max_active: 10
vfs.zfs.vdev.scrub_min_active: 1
vfs.zfs.vdev.scrub_max_active: 2
vfs.zfs.vdev.aggregation_limit: 131072
vfs.zfs.vdev.read_gap_limit: 32768

vfs.zfs.vdev.write_gap_limit: 4096
vfs.zfs.vdev.bio_flush_disable: 0
vfs.zfs.vdev.bio_delete_disable: 0
vfs.zfs.vdev.trim_max_bytes: 2147483648

vfs.zfs.vdev.trim_max_pending: 64
vfs.zfs.max_auto_ashift: 13
vfs.zfs.zil_replay_disable: 0
vfs.zfs.cache_flush_disable: 0
vfs.zfs.zio.use_uma: 1
vfs.zfs.zio.exclude_metadata: 0
vfs.zfs.sync_pass_deferred_free: 2

vfs.zfs.sync_pass_dont_compress: 5
vfs..zfs.sync_pass_rewrite: 2
vfs.zfs.snapshot_list_prefetch: 0
vfs.zfs.super_owner: 0
vfs.zfs.debug: 0
vfs.zfs.version.ioctl: 3
vfs.zfs.version.acl: 1
vfs.zfs.version.spa: 5000

vfs.zfs.version.zpl: 5
vfs.zfs.trim.enabled: 1
vfs.zfs.trim.txg_delay: 32
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.max_interval: 1
                                          

_______________________________________________

developer mailing list

[email protected]

http://lists.open-zfs.org/mailman/listinfo/developer









                                          
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to