Re: [Gluster-devel] How long should metrics collection on a cluster take?

Pranith Kumar Karampuri Wed, 25 Jul 2018 08:57:01 -0700

On Wed, Jul 25, 2018 at 8:17 PM, John Strunk <[email protected]> wrote:


> To add an additional data point... The operator will need to regularly
> reconcile the true state of the gluster cluster with the desired state
> stored in kubernetes. This task will be required frequently (i.e.,
> operator-framework defaults to every 5s even if there are no config
> changes).
> The actual amount of data we will need to query from the cluster is
> currently TBD and likely significantly affected by Heketi/GD1 vs. GD2
> choice.
>

Do we have any partial list of data we will gather? Just want to understand
what this might entail already...


>
> -John
>
>
> On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <
> [email protected]> wrote:
>
>>
>>
>> On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
>> [email protected]> wrote:
>>
>>> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
>>> <[email protected]> wrote:
>>> > hi,
>>> >       Quite a few commands to monitor gluster at the moment take
>>> almost a
>>> > second to give output.
>>>
>>> Is this at the (most) minimum recommended cluster size?
>>>
>>
>> Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
>>
>>
>>>
>>> > Some categories of these commands:
>>> > 1) Any command that needs to do some sort of mount/glfs_init.
>>> >      Examples: 1) heal info family of commands 2) statfs to find
>>> > space-availability etc (On my laptop replica 3 volume with all local
>>> bricks,
>>> > glfs_init takes 0.3 seconds on average)
>>> > 2) glusterd commands that need to wait for the previous command to
>>> unlock.
>>> > If the previous command is something related to lvm snapshot which
>>> takes
>>> > quite a few seconds, it would be even more time consuming.
>>> >
>>> > Nowadays container workloads have hundreds of volumes if not
>>> thousands. If
>>> > we want to serve any monitoring solution at this scale (I have seen
>>> > customers use upto 600 volumes at a time, it will only get bigger) and
>>> lets
>>> > say collecting metrics per volume takes 2 seconds per volume(Let us
>>> take the
>>> > worst example which has all major features enabled like
>>> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
>>> minutes
>>> > to collect metrics of the cluster with 600 volumes. What are the ways
>>> in
>>> > which we can make this number more manageable? I was initially
>>> thinking may
>>> > be it is possible to get gd2 to execute commands in parallel on
>>> different
>>> > volumes, so potentially we could get this done in ~2 seconds. But
>>> quite a
>>> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
>>> > collect different information like statfs, number of pending heals,
>>> quota
>>> > usage etc. This may lead to high memory usage as the size of the
>>> mounts tend
>>> > to be high.
>>> >
>>>
>>> I am not sure if starting from the "worst example" (it certainly is
>>> not) is a good place to start from.
>>
>>
>> I didn't understand your statement. Are you saying 600 volumes is a worst
>> example?
>>
>>
>>> That said, for any environment
>>> with that number of disposable volumes, what kind of metrics do
>>> actually make any sense/impact?
>>>
>>
>> Same metrics you track for long running volumes. It is just that the way
>> the metrics
>> are interpreted will be different. On a long running volume, you would
>> look at the metrics
>> and try to find why is the volume not giving performance as expected in
>> the last 1 hour. Where as
>> in this case, you would look at metrics and find the reason why volumes
>> that were
>> created and deleted in the last hour didn't give performance as expected.
>>
>>
>>>
>>> > I wanted to seek suggestions from others on how to come to a conclusion
>>> > about which path to take and what problems to solve.
>>> >
>>> > I will be happy to raise github issues based on our conclusions on
>>> this mail
>>> > thread.
>>> >
>>> > --
>>> > Pranith
>>> >
>>>
>>>
>>>
>>>
>>>
>>> --
>>> sankarshan mukhopadhyay
>>> <https://about.me/sankarshan.mukhopadhyay>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> [email protected]
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>>
>> --
>> Pranith
>> _______________________________________________
>> Gluster-devel mailing list
>> [email protected]
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>


-- 
Pranith

_______________________________________________
Gluster-devel mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How long should metrics collection on a cluster take?

Reply via email to