Quoting Christian Brauner ([email protected]): > We have quite some use cases where users already run into the current limit > for > {g,u}id mappings. Consider a user requesting us to map everything but 999, and > 1001 for a given range of 1000000000 with a sub{g,u}id layout of: > > some-user:100000:1000000000 > some-user:999:1 > some-user:1000:1 > some-user:1001:1 > some-user:1002:1 > > This translates to: > > MAPPING-TYPE CONTAINER HOST RANGE > uid 999 999 1 > uid 1001 1001 1 > uid 0 1000000 999 > uid 1000 1001000 1 > uid 1002 1001002 999998998 > > gid 999 999 1 > gid 1001 1001 1 > gid 0 1000000 999 > gid 1000 1001000 1 > gid 1002 1001002 999998998 > > which is already the current limit. > > Design Notes: > As discussed at LPC simply bumping the number of limits is not going to work > since this would mean that struct uid_gid_map won't fit into a single > cache-line > anymore thereby regressing performance for the base-cases. The same problem > seems to arise when using a single pointer. So the idea is to keep the base > cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single > cache-line of 64 byte. For the two removed mappings we place three pointers in > the struct that mock the behavior of traditional filesystems: > 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes > 2. an indirect pointer to an array of 64 byte of direct pointers to struct > uid_gid_extent of 5 mappings a 60 bytes each > 3. a double indirect pointer to an array of 64 bytes of indirect pointers each > to an array of 64 bytes of direct pointers (and so on) > Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) > = > 368 mappings which should really be enough. The idea of this approach is to > always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 > + > 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This > should only see a (i.e. linear) performance impact caused by iterating through > the idmappings in a for-loop. Note that the base cases shouldn't see any > performance degradation which is the most important part.
Sounds like a good plan. > Performance Testing: > When Eric introduced the extent-based struct uid_gid_map approach he measured > the performanc impact of his idmap changes: > > > My benchmark consisted of going to single user mode where nothing else was > > running. On an ext4 filesystem opening 1,000,000 files and looping through > > all > > of the files 1000 times and calling fstat on the individuals files. This > > was > > to ensure I was benchmarking stat times where the inodes were in the kernels > > cache, but the inode values were not in the processors cache. My results: > > > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support > > disabled) > > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and > > user namespace support disabled) > > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and > > user namespace support enabled) > > I used an identical approach on my laptop. Here's a thorough description of > what > I did. I built three kernels and used an additional "control" kernel: > > 1. v4.14-rc2-vanilla (unmodified v4.14-rc2) > 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch) > 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits > patch) ^ you mean *withYou your patch but with CONFIG_USER_NS=n ? > 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel) ^ Just curious, why did you include this? To show that other factors have a much larger impact? This does not include your patch, right? > > I booted into single user mode (systemd rescue target in newspeak) and used an > ext4 filesystem to open/create 1,000,000 files. Then I looped through all of > the > files calling fstat() on each of them 1000 times and calculated the mean > fstat() > time for a single file. (The test program can be found below.) > > For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following > cases: > 0 mappings > 1 mapping > 2 mappings > 3 mappings > 5 mappings > > For kernel v4.4-rc2-userns+ I tested: > 0 mappings > 1 mapping > 2 mappings > 3 mappings > 5 mappings > 10 mappings > 50 mappings > 100 mappings > 200 mappings > 300 mappings > > Here are the results: > > - v4.14-rc2-vanilla (unmodified v4.14-rc2) > # no unshare: 312 ns > unshare -U # write 0 mappings: 307 ns > unshare -U # write 1 mappings: 328 ns > unshare -U # write 2 mappings: 328 ns > unshare -U # write 3 mappings: 328 ns > unshare -U # write 5 mappings: 338 ns > > - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch) > # no unshare: 158 ns > unshare -U # write 0 mappings: 158 ns > unshare -U # write 1 mappings: 164 ns > unshare -U # write 2 mappings: 170 ns > unshare -U # write 3 mappings: 175 ns > unshare -U # write 5 mappings: 187 ns > unshare -U # write 10 mappings: 218 ns > unshare -U # write 50 mappings: 528 ns > unshare -U # write 100 mappings: 980 ns > unshare -U # write 200 mappings: 1880 ns > unshare -U # write 300 mappings: 2760 ns > > - v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and > user namespace support disabled) > # no unshare: 161 ns > > - 4.12.0-12-generic Ubuntu Kernel: > # no unshare: 328 ns > unshare -U # write 0 mappings: 327 ns > unshare -U # write 1 mappings: 328 ns > unshare -U # write 2 mappings: 328 ns > unshare -U # write 3 mappings: 328 ns > unshare -U # write 5 mappings: 338 ns > ^ This is really weird. Why does Ubuntu kernel have near-constant (horrible) time? > I've tested this multiple times and the numbers hold up. All v4.14-rc2 kernels > were built on the same machine with the same .config, the same options and a > simple call to make -j 11 bindeb-pkg. The 4.12 kernel was simply installed > from > the Ubuntu archives. > > The most import part seems to me that my idmap patches don't regress > performance > for the base-cases. I'd actually only consider 0 and 1 mapping to be the > proper Agreed. Now personally I probably would have kept 4 direct pointers then make the 5+ case hurt more, but I'm not saying that's the right thing. (haven't looked at the patch itself yet) thanks, -serge

