On Sat, 30 Mar 2019 at 08:06, Vijay Bellur <vbel...@redhat.com> wrote:
> > > On Fri, Mar 29, 2019 at 6:42 AM Atin Mukherjee <amukh...@redhat.com> > wrote: > >> All, >> >> As many of you already know that the design logic with which GlusterD >> (here on to be referred as GD1) was implemented has some fundamental >> scalability bottlenecks at design level, especially around it's way of >> handshaking configuration meta data and replicating them across all the >> peers. While the initial design was adopted with a factor in mind that GD1 >> will have to deal with just few tens of nodes/peers and volumes, the >> magnitude of the scaling bottleneck this design can bring in was never >> realized and estimated. >> >> Ever since Gluster has been adopted in container storage land as one of >> the storage backends, the business needs have changed. From tens of >> volumes, the requirements have translated to hundreds and now to thousands. >> We introduced brick multiplexing which had given some relief to have a >> better control on the memory footprint when having many number of >> bricks/volumes hosted in the node, but this wasn't enough. In one of our (I >> represent Red Hat) customer's deployment we had seen on a 3 nodes cluster, >> whenever the number of volumes go beyond ~1500 and for some reason if one >> of the storage pods get rebooted, the overall time it takes to complete the >> overall handshaking (not only in a factor of n X n peer handshaking but >> also the number of volume iterations, building up the dictionary and >> sending it over the write) consumes a huge time as part of the handshaking >> process, the hard timeout of an rpc request which is 10 minutes gets >> expired and we see cluster going into a state where none of the cli >> commands go through and get stuck. >> >> With such problem being around and more demand of volume scalability, we >> started looking into these areas in GD1 to focus on improving (a) volume >> scalability (b) node scalability. While (b) is a separate topic for some >> other day we're going to focus on more on (a) today. >> >> While looking into this volume scalability problem with a deep dive, we >> realized that most of the bottleneck which was causing the overall delay in >> the friend handshaking and exchanging handshake packets between peers in >> the cluster was iterating over the in-memory data structures of the >> volumes, putting them into the dictionary sequentially. With 2k like >> volumes the function glusterd_add_volumes_to_export_dict () was quite >> costly and most time consuming. From pstack output when glusterd instance >> was restarted in one of the pods, we could always see that control was >> iterating in this function. Based on our testing on a 16 vCPU, 32 GB RAM 3 >> nodes cluster, this function itself took almost *7.5 minutes . *The >> bottleneck is primarily because of sequential iteration of volumes, >> sequentially updating the dictionary with lot of (un)necessary keys. >> >> So what we tried out was making this loop to work on a worker thread >> model so that multiple threads can process a range of volume list and not >> all of them so that we can get more parallelism within glusterd. But with >> that we still didn't see any improvement and the primary reason for that >> was our dictionary APIs need locking. So the next idea was to actually make >> threads work on multiple dictionaries and then once all the volumes are >> iterated the subsequent dictionaries to be merged into a single one. Along >> with these changes there are few other improvements done on skipping >> comparison of snapshots if there's no snap available, excluding tiering >> keys if the volume type is not tier. With this enhancement [1] we see the >> overall time it took to complete building up the dictionary from the >> in-memory structure is *2 minutes 18 seconds* which is close* ~3x* >> improvement. We firmly believe that with this improvement, we should be >> able to scale up to 2000 volumes on a 3 node cluster and that'd help our >> users to get benefited with supporting more PVCs/volumes. >> >> Patch [1] is still in testing and might undergo few minor changes. But we >> welcome you for review and comment on it. We plan to get this work >> completed, tested and release in glusterfs-7. >> >> Last but not the least, I'd like to give a shout to Mohit Agrawal (In cc) >> for all the work done on this for last few days. Thank you Mohit! >> >> > > This sounds good! Thank you for the update on this work. > > Did you ever consider using etcd with GD1 (like as it is used with GD2)? > Honestly I had thought about it few times, but the primary reason was not to go forward with that direction was the bandwidth as such improvements isn’t a short term and tiny tasks, and also to keep in mind that GD2 tasks were in our plate too. If any other contributors are willing to take this up, I am more than happy to collaborate and provide guidance. Having etcd as a backing store for configuration could remove expensive > handshaking as well as persistence of configuration on every node. I am > interested in understanding if you are aware of any drawbacks with that > approach. If there haven't been any thoughts in that direction, it might be > a fun experiment to try. > > Thanks, > Vijay > -- - Atin (atinm)
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users