Your message dated Thu, 22 Feb 2018 17:16:39 +0100 with message-id <1519316199.2617.234.ca...@decadent.org.uk> and subject line Re: Increased ext4_inode_cache size wastes RAM under default SLAB allocator has caused the Debian Bug report #861964, regarding Increased ext4_inode_cache size wastes RAM under default SLAB allocator to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact ow...@bugs.debian.org immediately.) -- 861964: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861964 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
--- Begin Message ---Package: linux-image-amd64 Version: 4.9+80 Debian's use of the SLAB allocator combined with ongoing kernel changes mean the ext4 inode cache wastes ~21% of space allocated to it on recent amd64 kernels, a regression from the ~2% waste in jessie. SLAB enforces a first-order allocation (i.e. 4KB on x86[-64]) for slabs containing VFS-reclaimable objects such as ext4_inode_info: http://elixir.free-electrons.com/linux/v4.9.25/source/mm/slab.c#L1827 In jessie's Linux 3.16 kernel, an ext4_inode_cache entry is ~1000 bytes, so four fit nicely in a slab. Additions to this structure and its members have increased it to ~1072 bytes in 4.9.25 (on a machine with 32 logical cores): # grep ext4_inode_cache /proc/slabinfo name <active_objs> <num_objs> <objsize> <objperslab> ext4_inode_cache 956 987 1072 3 … …leaving 880 bytes wasted per slab in Debian stretch (and jessie-backports). Having 3 objects vs. 4 per slab may reduce internal fragmentation, but inodes can't linger for as long, and creating them evicts data, leading to increased disk activity. Slab cache allocation takes time; and if the slabs were denser, more inodes (or other content) could fit in CPU cache. By comparison, mainline's default SLUB allocator (used by Ubuntu) seems to use a 4 page/16KB or 8 page/32 KB slab size, which fits 15/30 ext4_inode_cache objects. This has also decreased since 3.16, but it is not as wasteful. Inode cache size is initially small, but may grow to ~50% of RAM under heavy workloads, e.g. fileserver rsync. == Possible workarounds/resolutions == A custom-compiled kernel with the right options reduces ext4_inode_cache object size below 1000 bytes - for me, it cut ~160MB from slab_cache on an active 32GB web app/file server with nightly rsync. (It may reduce CPU and disk utilization, but the load in question is not constant enough to benchmark.) Some flags have a big impact on ext4_inode_info (and subsidiary structs such as rw_semaphore): http://elixir.free-electrons.com/linux/v4.9.25/source/fs/ext4/ext4.h#L937 The precise sizes change with kernel version and CPU configuration. For jessie-backports' Linux 4.7.8, disabling both * EXT4 encryption (CONFIG_EXT4_FS_ENCRYPTION) _and_ either: a) VFS quota (CONFIG_QUOTA; OCSFS2 must be disabled first), or b) Optimistic rw_semaphore spinning (CONFIG_RWSEM_SPIN_ON_OWNER) reduced ext4_inode_cache objects to 1008-1016 bytes; sufficient to fit four inodes in a slab. It worked on 4.8.7 as well, reducing size to exactly 1024. But custom compilation is time-consuming and workload-dependent. Tossing ext4 encryption and quota is fine for our purposes, but Debian may not want to. Disabling optimistic semaphore owner spinning - perhaps under a certain number of cores? - may be part of a general solution; there's no menu option for CONFIG_RWSEM_SPIN_ON_OWNER, so it has to be set in the build config, or possibly on the command line. https://lkml.org/lkml/2014/8/3/120 suggests optimistic improves some contention-heavy workloads - or at least benchmarks thereof - but it may not be worth the trade-off by default. Incidentally, I found zero documentation that this may negatively impact memory usage. Getting into more significant code changes: Ted Ts'o shrunk ext4_inode_info by 8% six years ago: http://linux-ext4.vger.kernel.narkive.com/D3sK9Flg/patch-0-6-shrinking-the-size-of-ext4-inode-info …but it has since grown ~22%, due to features such as ext4 encryption, project-based quota, and the aforementioned optimistic spinning on the three read-write semaphores in the struct: https://github.com/torvalds/linux/commit/4fc828e24cd9c385d3a44e1b499ec7fc70239d8a https://github.com/torvalds/linux/commit/ce069fc920e5734558b3d9cbef1ab06cf01ee793 https://lwn.net/Articles/697603/ Ted mentioned that "it would be possible to further slim down the ext4_inode_cache by another 100 bytes or so, by breaking the ext4_inode_info into the portion of the inode required [when] a file is opened for writing, and everything else." This might be worth it, given that we're on the borderline, and particularly if rw_semaphore is included; there are attempts to make those even bigger: http://lists-archives.com/linux-kernel/28643980-locking-rwsem-enable-count-based-spinning-on-reader.html Adding an define to configure out project quota (kprojid_t i_projid) may cut a few bytes - or maybe more given alignment? I don't know if this would have a negative impact on filesystems which used them, other than the feature not working. At least it would give another knob to tweak. Adjusting struct alignment may also be beneficial, either in all cases or based on the presence/absence of flags, as in https://patchwork.ozlabs.org/patch/62051/ ext4_inode_info appears to contain a copy of the 256-byte on-disk format. Maybe it's feasible to use some of this in-place rather than duplicating it and writing it back later? Or it could be separated into its own object; it's a nice round size. (In-place use may violate style guidelines, if nothing else…) Lastly, 32-bit and uniprocessor kernels have far smaller ext4_inode_cache - I got one down to 560 bytes (7 obj/slab) - and may remain beneficial where RAM is strictly limited (VMs in particular). == SLAB vs. SLUB == Debain's use of SLAB allocation (vs. SLUB) might also be reconsidered. But I'm not sure this is as useful as just reducing the inode size. Both allocators appear to have improved over time (e.g. SLAB got 1-byte freelist entries). If anything, SLAB has had more work recently. The view in 2012 appeared to be that SLUB was less-suitable for multiprocessor systems than SLAB: https://lists.debian.org/debian-kernel/2012/03/msg00944.html And while Linus seems to want to get rid of SLAB: http://marc.info/?l=linux-mm&m=147423350524545&w=2 ... it seems SuSE also still uses it: http://marc.info/?l=linux-mm&m=147426644529856&w=2 In fact this problem might have been avoided with SLAB because it would have soaked up 4K blocks. http://marc.info/?l=linux-mm&m=147422898523307&w=2 Reducing structure size would benefit every allocator, so that should probably be the focus. -- Laurence "GreenReaper" Parry - Inkbunny administrator greenreaper.co.uk - wikifur.com - flayrah.com - inkbunny.net "Eternity lies ahead of us, and behind. Have you drunk your fill?"
--- End Message ---
--- Begin Message ---Version: 4.15.4-1 We've switched to using SLUB in unstable. Ben. -- Ben Hutchings [W]e found...that it wasn't as easy to get programs right as we had thought. ... I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs. - Maurice Wilkes, 1949
Description: This is a digitally signed message part
--- End Message ---