For very large IBM Power mainframe systems with hundreds of CPUs and TBs
of RAM booting can take a very long time.

Initial reports showed that booting a configuration of several hundred
CPUs and 64TB of RAM would take more than 30 minutes and require kernel
parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600
to prevent dropping into emergency mode.

Gathering information about what's happening during the boot is a bit
challenging but two main issues appeared to be: a large number of path
lookups for non-existent files, and very high lock contention in the VFS
during path walks particularly in the dentry allocation code path.

The underlying cause of this was thought to be the sheer number of sysfs
memory objects, 100,000+ for a 64TB memory configuration as the hardware
divides the memory into 256MB logical blocks. This is believed to be due
to either IBM Power hardware design or a requirement of the mainframe
software used to create logical partitions (LPARs, that are used to
install an operating system to provide services), since these can be made
up of a wide range of resources, CPU, Memory, disks, etc.

It's unclear yet whether the creation of syfs nodes for these memory
devices can be postponed or spread out over a larger amount of time.
That's because the high overhead looks to be due to notifications received
by udev which invokes a systemd program for them and attempts by systemd
folks to improve this have not focused on changing the handling of these
notifications, possibly because of difficulties with doing so. This
remains an avenue of investigation.

Kernel traces show there are many path walks with a fairly large portion
of those for non-existent paths. However, looking at the systemd code
invoked by the udev action it appears there's only one additional lookup
for each invocation so the large number of negative lookups is most likely
due to the large number of notifications rather than a fault with the
systemd program.

The series here tries to reduce the locking needed during path walks
based on the assumption that there are many path walks with a fairly
large portion of those for non-existent paths, as described above.

That was done by adding kernfs negative dentry caching (non-existent
paths) to avoid continual alloc/free cycle of dentries and a read/write
semaphore introduced to increase kernfs concurrency during path walks.

With these changes we still need kernel parameters of udev.children-max=2048
and systemd.default_timeout_start_sec=300 for the fastest boot times of
under 5 minutes.

There may be opportunities for further improvements but the series here
has seen a fair amount of testing and thinking about what else these could
be. Discussing it with Rick Lindsay, I suspect improvements will get more
difficult to implement for somewhat less improvement so I think what we
have here is a good start for now.

Changes since v1:
- fix locking in .permission() and .getattr() by re-factoring the attribute
  handling code.
---

Ian Kent (6):
      kernfs: switch kernfs to use an rwsem
      kernfs: move revalidate to be near lookup
      kernfs: improve kernfs path resolution
      kernfs: use revision to identify directory node changes
      kernfs: refactor attr locking
      kernfs: make attr_mutex a local kernfs node lock


 fs/kernfs/dir.c             |  284 ++++++++++++++++++++++++++++---------------
 fs/kernfs/file.c            |    4 -
 fs/kernfs/inode.c           |   58 +++++----
 fs/kernfs/kernfs-internal.h |   29 ++++
 fs/kernfs/mount.c           |   12 +-
 fs/kernfs/symlink.c         |    4 -
 include/linux/kernfs.h      |    7 +
 7 files changed, 259 insertions(+), 139 deletions(-)

--
Ian

Reply via email to