Hi All, Adding my two cents, We have two kinds of storage First based on SSD 600TB and 2nd on spinning disks. 20PB.
for SSD i tried glusterfs ,beefs and compare them with lustre . Beefs failed because of ln (hard link issues) at that time (2019) . Glusterfs was very promising but somehow the lustre with zfs showed good performance. on our 20PB we are running lustre based on zfs and happy with it. I am trying to write a whitepaper with different parameters to tweak the lustre performance from disks, to zfs to lustre . This could help the HPC community, especially in life science. /Zee Section head of IT Infrastructure, Centre for Genomic Medicine , KFSHRC , Riyadh On Mon, Apr 5, 2021 at 6:54 PM Olaf Buitelaar <olaf.buitel...@gmail.com> wrote: > Hi Erik, > > Thanks for sharing your unique use-case and solution. It was very > interesting to read your write-up. > > I agree with your point; " * Growing volumes, replacing bricks, and > replacing servers work. > However, the process is very involved and quirky for us. I have....." > in your use-case 1 last point. > > I do seem to suffer from similar issues where glusterd just doesn't want > to start up correctly at first time, maybe also see: > https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as > one possible cause for not starting up correctly. > And secondly indeed for a "molten to the floor" server it would be great > if gluster instead of the current replace-brick commands, would have > something to replace a complete host, and would just recreate the all > bricks (or better yet all missing bricks, in case some survived from > another RAID/disk), which it would expect on that host, and would heal them > all. > right now indeed this process is quite involved, and sometimes feels a bit > like performing black magic. > > Best Olaf > > Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson <erik.jacob...@hpe.com>: > >> A while back I was asked to make a blog or something similar to discuss >> the use cases the team I work on (HPCM cluster management) at HPE. >> >> If you are not interested in reading about what I'm up to, just delete >> this and move on. >> >> I really don't have a public blogging mechanism so I'll just describe >> what we're up to here. Some of this was posted in some form in the past. >> Since this contains the raw materials, I could make a wiki-ized version >> if there were a public place to put it. >> >> >> >> We currently use gluster in two parts of cluster management. >> >> In fact, gluster in our management node infrastructure is helping us to >> provide scaling and consistency to some of the largest clusters in the >> world, clusters in the TOP100 list. While I can get in to trouble by >> sharing too much, I will just say that trends are continuing and the >> future may have some exciting announcements on where on TOP100 certain >> new giant systems may end up in the coming 1-2 years. >> >> At HPE, HPCM is the "traditional cluster manager." There is another team >> that develops a more cloud-like solution and I am not discussing that >> solution here. >> >> >> Use Case #1: Leader Nodes and Scale Out >> >> ------------------------------------------------------------------------------ >> - Why? >> * Scale out >> * Redundancy (combined with CTDB, any leader can fail) >> * Consistency (All servers and compute agree on what the content is) >> >> - Cluster manager has an admin or head node and zero or more leader nodes >> >> - Leader nodes are provisioned in groups of 3 to use distributed >> replica-3 volumes (although at least one customer has interest >> in replica-5) >> >> - We configure a few different volumes for different use cases >> >> - We use Gluster NFS still because, over a year ago, Ganesha was not >> working with our workload and we haven't had time to re-test and >> engage with the community. No blame - we would also owe making sure >> our settings are right. >> >> - We use CTDB for a measure of HA and IP alias management. We use this >> instead of pacemaker to reduce complexity. >> >> - The volume use cases are: >> * Image sharing for diskless compute nodes (sometimes 6,000 nodes) >> -> Normally squashFS image files for speed/efficiency exported NFS >> -> Expanded ("chrootable") traditional NFS trees for people who >> prefer that, but they don't scale as well and are slower to boot >> -> Squashfs images sit on a sharded volume while traditional gluster >> used for expanded tree. >> * TFTP/HTTP for network boot/PXE including miniroot >> -> Spread across leaders too due so one node is not saturated with >> PXE/DHCP requests >> -> Miniroot is a "fatter initrd" that has our CM toolchain >> * Logs/consoles >> -> For traditional logs and consoles (HCPM also uses >> elasticsearch/kafka/friends but we don't put that in gluster) >> -> Separate volume to have more non-cached friendly settings >> * 4 total volumes used (one sharded, one heavily optimized for >> caching, one for ctdb lock, and one traditional for logging/etc) >> >> - Leader Setup >> * Admin node installs the leaders like any other compute nodes >> * A setup tool operates that configures gluster volumes and CTDB >> * When ready, an admin/head node can be engaged with the leaders >> * At that point, certain paths on the admin become gluster fuse mounts >> and bind mounts to gluster fuse mounts. >> >> - How images are deployed (squashfs mode) >> * User creates an image using image creation tools that make a >> chrootable tree style image on the admin/head node >> * mksquashfs will generate a squashfs image file on to a shared >> storage gluster mount >> * Nodes will mount the filesystem with the squashfs images and then >> loop mount the squashfs as part of the boot process. >> >> - How are compute nodes tied to leaders >> * We simply have a variable in our database where human or automated >> discovery tools can assign a given node to a given IP alias. This >> works better for us than trying to play routing tricks or load >> balance tricks >> * When leaders PXE, the DHCP response includes next-server and the >> compute node uses the leader IP alias for the tftp/http for >> getting the boot loader DHCP config files are on shared storage >> to facilitate future scaling of DHCP services. >> * ipxe or grub2 network config files then fetch the kernel, initrd >> * initrd has a small update to load a miniroot (install environment) >> which has more tooling >> * Node is installed (for nodes with root disks) or does a network boot >> cycle. >> >> - Gluster sizing >> * We typically state compute nodes per leader but this is not for >> gluster per-se. Squashfs image objects are very efficient and >> probably would be fine for 2k nodes per leader. Leader nodes provide >> other services including console logs, system logs, and monitoring >> services. >> * Our biggest deployment at a customer site right now has 24 leader >> nodes. Bigger systems are coming. >> >> - Startup scripts - Getting all the gluster mounts and many bind mounts >> used in the solution, as well as ensuring gluster mounts and ctdb lock >> is available before ctdb start was too painful for my brian. So we >> have systemd startup scripts that sanity test and start things >> gracefully. >> >> - Future: The team is starting to test what a 96-leader (96 gluster >> servers) might look like for future exascale systems. >> >> - Future: Some customers have interest in replica-5 instead of >> replica-3. We want to test performance implications. >> >> - Future: Move to Ganesha, work with community if needed >> >> - Future: Work with Gluster community to make gluster fuse mounts >> efficient instead of NFS (may be easier with image objects than it was >> the old way with fully expanded trees for images!) >> >> - Challenges: >> * Every month someone tells me gluster is going to die because of Red >> Hat vs IBM and I have to justify things. It is getting harder. >> * Giant squashfs images fail - mksquashfs reports error - at around >> 90GB on sles15sp2 and sles15sp3. rhel8 does not suffer. Don't have >> the bandwidth to dig in right now but one upset customer. Work >> arounds provided to move development environment for that customer >> out of the operating system image. >> * Since we have our own build and special use cases, we're on our own >> for support (by "on our own" I mean no paid support, community help >> only). Our complex situations can produce some cases that you guys >> don't see and debugging them can take a month or more with the >> volunteer nature of the community. Paying for support is harder, >> even if it were viable politically, since we support 6 distros and 3 >> distro providers. Of course, paying for support is never the >> preference of management. It might be an interesting thing to >> investigate. >> * Any gluster update is terror. We don't update much because finding a >> gluster version that is stable for all our use cases PLUS being able >> to >> test at scale which means thousands of nodes is hard. We did some >> internal improvements here where we emulate a 2500-node-cluster >> using virtual machines on a much smaller cluster. However, it isn't >> ideal. So we start lagging the community over time until some >> problem forces us to refresh. Then we tip-toe in to the update. We >> most recently updated to gluster79 and it solved several problems >> related to use case #2 below. >> * Due to lack of HW, testing the SU Leader solution is hard because of >> the number of internal clusters. However, I recently moved my >> primary development to my beefy desktop where I have a full cluster >> stack including 3 leader nodes with gluster running in virtual >> machines. So we have eliminated an excuse preventing internal people >> from playing with the solution stack. >> * Growing volumes, replacing bricks, and replacing servers work. >> However, the process is very involved and quirky for us. I have >> complicated code that has to do more than I'd really like to do to >> simply wheel in a complete leader replacement for a failed one. Even >> with our tools, we often send up with some glusterd's not starting >> right and have to restart a few times to get a node or brick >> replacement going. I wish the process were just as simple as running >> a single command or set of commands and having it do the right >> thing. >> -> Finding two leaders to get a complete set of peer files >> -> Wedging in the replacement node with the UUID the node in that >> position had before >> -> Don't accidentally give a gluster peer it's own peer file >> -> Then go through an involved replace brick procedure >> -> "replace" vs " reset" >> -> A complicated dance with XATTRs that I still don't understand >> -> Ensuring indices and .glusterfs pre-exist with right >> permissions >> My hope is I just misunderstand and will bring this up in a future >> discussion. >> >> >> >> >> Use Case #2: Admin/Head Node High Availability >> >> ------------------------------------------------------------------------------ >> - We used to have an admin node HA solution that contained two servers >> and an external storage device. A VM was used for the "real admin >> node" provided by the two servers. >> >> - This solution was expensive due to the external storage >> >> - This solution was less optimal due to not having true quorum >> >> - Building on our gluster experience, we removed the external storage >> and added a 3rd server. >> (Due to some previous experience with DRBD we elected to stay with >> gluster here) >> >> - We build a gluster volume shared across the 3 servers, sharded, which >> primarily holds a virtual machine image file used by the admin node VM >> >> - The physical nodes use bonding for network redundancy and bridges to >> feed them in to the VM. >> >> - We use pacemaker in this case to manage the setup >> >> - It's pretty simple - pacemaker rules manage a VirtualDomain instance >> and a simple ssh monitor makes sure we can get in to it. >> >> - We typically have a 3-12T single image sitting on gluster sharded >> shared storage used by the virtual machine, which forms the true admin >> node. >> >> - We set this up with manual instructions but tooling is coming soon to >> aid in automated setup of this solution >> >> - This solution is in use actively in at least 4 very large >> supercomputers. >> >> - I am impressed by the speed of the solution on the sharded volume. The >> VM image creation speed using libvirt to talk to the image file hosted >> on a sharded gluster volume works slick. >> (We use the fuse mount because we don't want to build our own >> qemu/libvirt, which would be needed at least for SLES and maybe >> RHEL too since we have our own gluster build) >> >> - Challenges: >> * Not being able to boot all of a sudden was a terror for us (where >> the VM would only see the disk size as the size of a shard at random >> times). >> -> Thankfully community helped guide us to gluster79 and that >> resolved it >> >> * People keep asking to make a 2-node version but I push back. >> Especially with gluster but honestly with other solutions too, don't >> cheap out with arbitration is what I try to tell people. >> >> >> Network Utilization >> >> ------------------------------------------------------------------------------ >> - We tyhpically have 2x 10G bonds on leaders >> - For simplicity, we co-mingle gluster and compute node traffic together >> - It is very rare that we approach 20G full utilization even in very >> intensive operations >> - Newer solutions may increase the speed of the bonds, but it isn't due >> to a current pressing need. >> - Locality Challenge: >> * For future Exascale systems, we may need to become concerned about >> locality >> * Certain compute nodes are far closer to some gluster servers than >> others >> * And gluster servers themselves need to talk among themselves but >> could be stretched across the topology >> * We have tools to monitor switch link utilization and so far have >> not hit a scary zone >> * Somewhat complicated by fault tolerance. It would be sad to design >> the leaders such that a PDU goes bad so you lose qourum because >> 3 leaders in the same replica-3 were on the same PDU >> * But spreading them out may have locality implications >> * This is a future area of study for us. We have various ideas in >> mind if a problem strikes. >> ________ >> >> >> >> Community Meeting Calendar: >> >> Schedule - >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> Bridge: https://meet.google.com/cpu-eiue-hvk >> Gluster-users mailing list >> Gluster-users@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users