Just add on, we are using gluster beside our main storage Lustre for k8s cluster .
On Wed, Mar 24, 2021 at 4:33 AM Ewen Chan <alpha754...@hotmail.com> wrote: > Erik: > > I just want to say that I really appreciate you sharing this information > with us. > > I don't think that my personal home lab micro cluster environment may get > that complicated enough where I have a virtualized testing/Gluster > development setup like you have, but on the other hand, as I mentioned > before, I am running 100 Gbps Infiniband so what I am trying to do/use > Gluster for is quite different than what and how most people deploy/install > Gluster for production systems. > > If I wanted to splurge, I'd get a second set of IB cables so that the high > speed interconnect layer can be split so that jobs will run on one layer of > the Infiniband fabric whilst storage/Gluster may run on another layer. > > But for that, I'll have to revamp my entire microcluster, so there are no > plans to do that just yet. > > Thank you. > > Sincerely, > Ewen > > ------------------------------ > *From:* gluster-users-boun...@gluster.org < > gluster-users-boun...@gluster.org> on behalf of Erik Jacobson < > erik.jacob...@hpe.com> > *Sent:* March 23, 2021 10:43 AM > *To:* Diego Zuccato <diego.zucc...@unibo.it> > *Cc:* gluster-users@gluster.org <gluster-users@gluster.org> > *Subject:* Re: [Gluster-users] Gluster usage scenarios in HPC cluster > management > > > I still have to grasp the "leader node" concept. > > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > > mentioned in the fstab entry like > > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > > while the peer list includes l1,l2,l3 and a bunch of other nodes? > > Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 > replicated/distributed setup for the volumes. They also have entries > for themselves as clients in /etc/fstab. I'll dump some volume info > at the end of this. > > > > > So we would have 24 leader nodes, each leader would have a disk serving > > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > > one is for logs, and one is heavily optimized for non-object expanded > > > tree NFS). The term "disk" is loose. > > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > > 36 bricks per node). > > I have one dedicated "disk" (could be disk, raid lun, single ssd) and > 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just > for the lock and has a single file. > > > > > > Specs of a leader node at a customer site: > > > * 256G RAM > > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > > bricks in 64GB RAM... :) > > I'm not an expert in memory pools or how they would be impacted by more > peers. I had to do a little research and I think what you're after is > if I can run gluster volume status cm_shared mem on a real cluster > that has a decent node count. I will see if I can do that. > > > TEST ENV INFO for those who care > -------------------------------- > Here is some info on my own test environemnt which you can skip. > > I have the environment duplicated on my desktop using virtual machines and > it > runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache > from the optimized volumes but other than that it is fine. In my > development environment, the gluster disk is a 40G qcow2 image. > > Cache sizes changed from 8G to 100M to fit in the VM. > > XML snips for memory, cpus: > <domain type='kvm' id='24'> > <name>cm-leader1</name> > <uuid>99d5a8fc-a32c-b181-2f1a-2929b29c3953</uuid> > <memory unit='KiB'>3268608</memory> > <currentMemory unit='KiB'>3268608</currentMemory> > <vcpu placement='static'>2</vcpu> > <resource> > ...... > > > I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test > compute node for my development environment. > > My desktop where I test this cluster stack is a beefy but not brand new > desktop: > > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 46 bits physical, 48 bits virtual > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 79 > Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz > Stepping: 1 > CPU MHz: 2594.333 > CPU max MHz: 3000.0000 > CPU min MHz: 1200.0000 > BogoMIPS: 4190.22 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 20480K > NUMA node0 CPU(s): 0-15 > <SNIP> > > > (Not that it matters but this is a HP Z640 Workstation) > > 128G memory (good for a desktop I know, but I think 64G would work since > I also run windows10 vm environment for unrelated reasons) > > I was able to find a MegaRAID in the lab a few years ago and so I have 4 > drives in a MegaRAID and carve off a separate volume for the VM disk > images. It has a cache. So that's also more beefy than a normal desktop. > (on the other hand, I have no SSDs. May experiment with that some day > but things work so well now I'm tempted to leave it until something > croaks :) > > I keep all VMs for the test cluster with "Unsafe cache mode" since there > is no true data to worry about and it makes the test cases faster. > > So I am able to test a complete cluster management stack including > 3-leader-gluster servers, an admin, and compute all on my desktop using > virtual machines and shared networks within libivrt/qemu. > > It is so much easier to do development when you don't have to reserve > scarce test clusters and compete with people. I can do 90% of my cluster > development work this way. Things fall over when I need to care about > BMCs/ILOs or need to do performance testing of course. Then I move to > real hardware and play the hunger-games-of-internal-test-resources :) :) > > I mention all this just to show that the beefy servers are not needed > nor the memory usage high. I'm not continually swapping or anything like > that. > > > > > Configuration Info from Real Machine > ------------------------------------ > > Some info on an active 3x3 cluster. 2738 compute nodes. > > The most active volume here is "cm_obj_sharded". It is where the image > objects live and this cluster uses image objects for compute node root > filesystems. I by hand changed the IP addresses (in case I made an > error doing that). > > > Memory status for volume : cm_obj_sharded > ---------------------------------------------- > Brick : 10.1.0.5:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 20676608 > Ordblks : 2077 > Smblks : 518 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 53728 > Uordblks : 5223376 > Fordblks : 15453232 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.6:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 21409792 > Ordblks : 2424 > Smblks : 604 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 62304 > Uordblks : 5468096 > Fordblks : 15941696 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.7:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 24240128 > Ordblks : 2471 > Smblks : 563 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 58832 > Uordblks : 5565360 > Fordblks : 18674768 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.8:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 22454272 > Ordblks : 2575 > Smblks : 528 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 53920 > Uordblks : 5583712 > Fordblks : 16870560 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.9:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 22835200 > Ordblks : 2493 > Smblks : 570 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 59728 > Uordblks : 5424992 > Fordblks : 17410208 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.10:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 23085056 > Ordblks : 2717 > Smblks : 697 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 74016 > Uordblks : 5631520 > Fordblks : 17453536 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.11:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 26537984 > Ordblks : 3044 > Smblks : 985 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 103056 > Uordblks : 5702592 > Fordblks : 20835392 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.12:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 23556096 > Ordblks : 2658 > Smblks : 735 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 78720 > Uordblks : 5568736 > Fordblks : 17987360 > Keepcost : 127616 > > ---------------------------------------------- > Brick : 10.1.0.13:/data/brick_cm_obj_sharded > Mallinfo > -------- > Arena : 26050560 > Ordblks : 3064 > Smblks : 926 > Hblks : 17 > Hblkhd : 17350656 > Usmblks : 0 > Fsmblks : 96816 > Uordblks : 5807312 > Fordblks : 20243248 > Keepcost : 127616 > > ---------------------------------------------- > > > > Volume configuration details for this one: > > Volume Name: cm_obj_sharded > Type: Distributed-Replicate > Volume ID: 76c30b65-7194-4af2-80f7-bf876f426e5a > Status: Started > Snapshot Count: 0 > Number of Bricks: 3 x 3 = 9 > Transport-type: tcp > Bricks: > Brick1: 10.1.0.5:/data/brick_cm_obj_sharded > Brick2: 10.1.0.6:/data/brick_cm_obj_sharded > Brick3: 10.1.0.7:/data/brick_cm_obj_sharded > Brick4: 10.1.0.8:/data/brick_cm_obj_sharded > Brick5: 10.1.0.9:/data/brick_cm_obj_sharded > Brick6: 10.1.0.10:/data/brick_cm_obj_sharded > Brick7: 10.1.0.11:/data/brick_cm_obj_sharded > Brick8: 10.1.0.12:/data/brick_cm_obj_sharded > Brick9: 10.1.0.13:/data/brick_cm_obj_sharded > Options Reconfigured: > nfs.rpc-auth-allow: 10.1.* > auth.allow: 10.1.* > performance.client-io-threads: on > nfs.disable: off > storage.fips-mode-rchecksum: on > transport.address-family: inet > performance.cache-size: 8GB > performance.flush-behind: on > performance.cache-refresh-timeout: 60 > performance.nfs.io-cache: on > nfs.nlm: off > nfs.export-volumes: on > nfs.export-dirs: on > nfs.exports-auth-enable: on > transport.listen-backlog: 16384 > nfs.mount-rmtab: /- > performance.io-thread-count: 32 > server.event-threads: 32 > nfs.auth-refresh-interval-sec: 360 > nfs.auth-cache-ttl-sec: 360 > features.shard: on > > > > > There are 3 other volumes (this is the only sharded one). I can provide > more info if desired. > > Typical boot times for 3k nodes and 9 leaders, ignoring BIOS setup time, > is 2-5 minutes. The power of the image objects is what makes that fast. > An exapnded tree (traditional) nfs export where the whole directory tree > would be exported and used file by file would be more like 9-12 minutes. > > > Erik > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users