Hi Erik, It's great to hear positive feedback! Thanks for taking out time to send out this email. It means a lot to us :)
On Thu, Apr 9, 2020 at 10:55 AM Strahil Nikolov <hunter86...@yahoo.com> wrote: > On April 8, 2020 10:15:27 PM GMT+03:00, Erik Jacobson < > erik.jacob...@hpe.com> wrote: > >I wanted to share some positive news with the group here. > > > >Summary: Using sharding and squashfs image files instead of expanded > >directory trees for RO NFS OS images have led to impressive boot times > >of > >2k diskless node clusters using 12 servers for gluster+tftp+etc+etc. > > > >Details: > > > >As you may have seen in some of my other posts, we have been using > >gluster to boot giant clusters, some of which are in the top500 list of > >HPC resources. The compute nodes are diskless. > > > >Up until now, we have done this by pushing an operating system from our > >head node to the storage cluster, which is made up of one or more > >3-server/(3-brick) subvolumes in a distributed/replicate configuration. > >The servers are also PXE-boot and tftboot servers and also serve the > >"miniroot" (basically a fat initrd with a cluster manager toolchain). > >We also locate other management functions there unrelated to boot and > >root. > > > >This copy of the operating system is a simple a directory tree > >representing the whole operating system image. You could 'chroot' in to > >it, for example. > > > >So this operating system is a read-only NFS mount point used as a base > >by all compute nodes to use as their root filesystem. > > > >This has been working well, getting us boot times (not including BIOS > >startup) of between 10 and 15 minutes for a 2,000 node cluster. > >Typically a > >cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On > >simple > >RHEL8 images without much customization, I tend to get 10 minutes. > > > >We have observed some slow-downs with certain job launch work loads for > >customers who have very metadata intensive job launch. The metadata > >load > >of such an operation is very intensive, with giant loads being observed > >on the gluster servers. > > > >We recently started supporting RW NFS as opposed to TMPFS for this > >solution for the writable components of root. Our customers tend to > >prefer > >to keep every byte of memory for jobs. We came up with a solution of > >hosting > >the RW NFS sparse files with XFS filesystems on top from a writable > >area in > >gluster for NFS. This makes the RW NFS solution very fast because it > >reduces > >RW NFS metadata per-node. Boot times didn't go up significantly (but > >our > >first attempt with just using a directory tree was a slow disaster, > >hitting > >the worse-case lots of small file write + lots of metadata work load). > >So we > >solved that problem with XFS FS images on RW NFS. > > > >Building on that idea, we have in our development branch, a version of > >the > >solution that changes the RO NFS image to a squashfs file on a sharding > >volume. That is, instead of each operating system being many thousands > >of files and being (slowly) synced to the gluser servers, the head node > >makes a squashfs file out of the image and pushes that. Then all the > >compute nodes mount the squashfs image from the NFS mount. > > (mount RO NFS mount, loop-mount squashfs image). > > > >On a 2,000 node cluster I had access to for a time, our prototype got > >us > >boot times of 5 minutes -- including RO NFS with squashfs and the RW > >NFS > >for writable areas like /etc, /var, etc (on an XFS image file). > > * We also tried RW NFS with OVERLAY and no problem there > > > >I expect, for people who prefer the squashfs non-expanded format, we > >can > >reduce the leader per compute density. > > > >Now, not all customers will want squashfs. Some want to be able to edit > >a file and see it instantly on all nodes. However, customers looking > >for > >fast boot times or who are suffering slowness on metadata intensive > >job launch work loads, will have a new fast option. > > > >Therefore, it's very important we still solve the bug we're working on > >in another thread. But I wanted to share something positive. > > > >So now I've said something positive instead of only asking for help :) > >:) > > > >Erik > >________ > > > > > > > >Community Meeting Calendar: > > > >Schedule - > >Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >Bridge: https://bluejeans.com/441850968 > > > >Gluster-users mailing list > >Gluster-users@gluster.org > >https://lists.gluster.org/mailman/listinfo/gluster-users > > Good Job Erik! > > Best Regards, > Strahil Nikolov > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- Regards, Hari Gowtham.
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users