On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay < [email protected]> wrote:
> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri > <[email protected]> wrote: > > hi, > > Quite a few commands to monitor gluster at the moment take almost a > > second to give output. > > Is this at the (most) minimum recommended cluster size? > Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster. > > > Some categories of these commands: > > 1) Any command that needs to do some sort of mount/glfs_init. > > Examples: 1) heal info family of commands 2) statfs to find > > space-availability etc (On my laptop replica 3 volume with all local > bricks, > > glfs_init takes 0.3 seconds on average) > > 2) glusterd commands that need to wait for the previous command to > unlock. > > If the previous command is something related to lvm snapshot which takes > > quite a few seconds, it would be even more time consuming. > > > > Nowadays container workloads have hundreds of volumes if not thousands. > If > > we want to serve any monitoring solution at this scale (I have seen > > customers use upto 600 volumes at a time, it will only get bigger) and > lets > > say collecting metrics per volume takes 2 seconds per volume(Let us take > the > > worst example which has all major features enabled like > > snapshot/geo-rep/quota etc etc), that will mean that it will take 20 > minutes > > to collect metrics of the cluster with 600 volumes. What are the ways in > > which we can make this number more manageable? I was initially thinking > may > > be it is possible to get gd2 to execute commands in parallel on different > > volumes, so potentially we could get this done in ~2 seconds. But quite a > > few of the metrics need a mount or equivalent of a mount(glfs_init) to > > collect different information like statfs, number of pending heals, quota > > usage etc. This may lead to high memory usage as the size of the mounts > tend > > to be high. > > > > I am not sure if starting from the "worst example" (it certainly is > not) is a good place to start from. I didn't understand your statement. Are you saying 600 volumes is a worst example? > That said, for any environment > with that number of disposable volumes, what kind of metrics do > actually make any sense/impact? > Same metrics you track for long running volumes. It is just that the way the metrics are interpreted will be different. On a long running volume, you would look at the metrics and try to find why is the volume not giving performance as expected in the last 1 hour. Where as in this case, you would look at metrics and find the reason why volumes that were created and deleted in the last hour didn't give performance as expected. > > > I wanted to seek suggestions from others on how to come to a conclusion > > about which path to take and what problems to solve. > > > > I will be happy to raise github issues based on our conclusions on this > mail > > thread. > > > > -- > > Pranith > > > > > > > > -- > sankarshan mukhopadhyay > <https://about.me/sankarshan.mukhopadhyay> > _______________________________________________ > Gluster-devel mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/gluster-devel > -- Pranith
_______________________________________________ Gluster-devel mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-devel
