Re: [Gluster-devel] Metrics: and how to get them out from gluster

Xavier Hernandez Fri, 01 Sep 2017 00:29:34 -0700

Hi Amar,

I don't have time to review the changes in experimental branch yet, buthere are some comments about these ideas...


On 01/09/17 07:27, Amar Tumballi wrote:

Disclaimer: This email is long, and did take significant time to write.Do take time and read, review and give feedback, so we can have somemetrics related tasks done by Gluster 4.0
---
** History:*
To understand what is happening inside GlusterFS process, over theyears, we have opened many bugs and also coded few things with regard tostatedump, and did put some effort into io-stats translator to improvethe gluster's monitoring capabilities.
But surely there is more required! And some glimpse of it is captured in[1], [2], [3] & [4]. Also, I did send an email to this group [5] aboutpossibilities of capturing this information.
** Current problem:*
When we talk about metrics or monitoring, we have to consider giving outthese data to a tool which can preserve the readings in a periodic time,without a time graph, no metrics will make sense! So, the firstchallenge itself is how to get them out? Should getting the metrics outfrom each process need 'glusterd' interacting? or should we use signals?Which leads us to *'challenge #1'.*

One problem I see here is that we will have multiple bricks and multipleclients (including FUSE and gfapi).

I assume we want to be able to monitor whole volume performance(aggregate values of all mount points), specific mount performance, andeven specific brick performance.

In this case, the signal approach seems quite difficult to me, speciallyfor gfapi based clients. Even for fuse mounts and brick processes wewould need to connect to each place where one of these processes is andsend the signal there. In this case, some clients may be not prepared tobe accessed remotely in an easy way.

Using glusterd this problem could be minimized, but I'm not sure thatthe interface would be easy to implement (basically because we wouldneed some kind of filtering syntax to avoid huge outputs) and the outputcould be complex to parse for other tools, specially considering thatthe amount of data could be significant and it will can change with theaddition or change of translators.

I propose a third approach. It's based on a virtual directory similar to/sys and /proc on linux. We already have /.meta in gluster. We couldextend that in a way that we could have data there from each mount point(fuse of gfapi), and each brick. Then we could define an api to alloweach xlator to publish information in that directory in a simple way.

Using this approach, monitor tools can check only the interesting datadirectly mounting the volume as any other client and reading the desiredvalues.

To implement this we could centralize all statistics capturing inlibglusterfs itself, and create a new translator (or reuse meta) togather this information from libglusterfs and publish it into thevirtual directory (probably we would need a server side and a clientside xlator to be able to combine data from all mounts and bricks).

Next is, should we depend on io-stats to do the reporting? If yes, howto get information from between any two layers? Should we provideio-stats in between all the nodes of translator graph?

I whouldn't depend on io-stats for reporting all the data. Themonitoring seems to me a deeper thing than what a single translator can do.

Using the virtual directory approach, io-stats can place its statisticsthere, but it doesn't need to be aware of all other possible statisticsfrom other xlators because each one will report its own statisticsindependently.

or should weutilize STACK_WIND/UNWIND framework to get the details? This is our*'challenge #2'*

I think that gluster core itself (basically libglusterfs) should keepits own details on global things like this. This details could also bepublished in the virtual directory. From my point of view, io-statsshould be left to provide global timings for the fops or be merged withthe STACK_WIND/UNWIND framework and removed as an xlator.

Once the above decision will be taken, then the question is, "what about'metrics' from other translators? Who gives it out (ie, dumps it?)? Whydo we need something similar to statedump, and can't we read info fromstatedump itself?".

I think it would be better and easier to move the information from thestatedump to the virtual directory instead of trying to use thestatedump to report everything.

But when we say 'metrics', we should have a key anda number associated with it, statedump has lot more, and no format. Ifits different from statedump, then what is our answer for translatorcode to give out metrics? This is our *'challenge #3*'

Using the virtual directory structure, our key would be an specific filename in some directory that represents the hierarchical structure of thevolume (xlators), and the value would be its contents.

Using this approach we could even allow some virtual files to bewritable to trigger some action inside the whole volume, an specificmount or a brick, but this doesn't need to be considered right now.

If we get a solution to above challenges, then I guess we are in adecent shape for further development. Lets go through them one by one,in detail.
** Problems and proposed solutions:*

*a) how to dump metrics data ?*
Currently, I propose signal handler way, as it will give control for usto choose what are the processes we need to capture information on, andwill be much faster than communicating through another tool. Alsoconsidering we need to have these metrics taken every 10sec or so, therewill be a need for efficient way to get this out.

Probably this is not enough. One clear example is multiplexed bricks. Weonly have a single process, so a signal will dump information about allof them. How will we be able to get information only from a single brick? we can process all the output, but this is unnecessary work when weonly one a small piece of information.

But even there, we have challenges, because we have already chosen bothUSR1 and USR2 signal handlers, one for statedump, another for togglinglatency monitoring respectively. It makes sense to continue to havestatedump use USR1, but toggling options should be technically (forcorrectness too) be handled by glusterd volume set options, and thereshould be a way to handle it in a better way by our 'reconfigure()'framework in graph-switch. Proposal sent in github issue #303 [6].
If we are good with above proposal, then we can make use of USR2 formetrics dump. Next issue will be about the format of the file itself,which we will discuss at the end of the email.
NOTE: Above approach is already implemented in 'experimental' branch,excluding handling of [6].
*b) where to measure the latency and fops counts?*
One of the possible way is to load io-stats in between all the nodes,but it has its own limitations. Mainly, how to configure options in eachof this translator, will having too many translators slow down operation? (ie, create one extra 'frame' for every fop, and in a graph of 20xlator, it will be 20 extra frame creates for a single fop).


As I said previously I don't like this approach either.

I propose we handle this in 'STACK_WIND/UNWIND' macros itself, andprovide a placeholder to store all this data in translator structureitself. This will be more cleaner, and no changes are required in codebase, other than in 'stack.h (and some in xlator.h)'.


I agree.

Also, we can provide 'option monitoring enable' (or disable) option as adefault option for every translator, and can handle it at xlator_init()time itself. (This is not a blocker for 4.0, but good to have). Ideaproposed @ github #304 [7].

I'm not sure if this is really necessary. As I understand it, monitoringwill be based exclusively on counters. Updating a counter is reallyfast. Adding an option to disable it will mean that the code will needto check if this option is enabled or not before updating the counters,which is slower.

One thing we could do however, is to add options to the xlator thatpublishes the data to tell it what statistics to show in the virtualdirectory. This way we can globally ignore statistics reported by somexlator if we want, but without needing to put specific code into eachtranslator to enable or disable it.

NOTE: this approach is working pretty good already at 'experimental'branch, excluding [7]. Depending on feedback, we can improve it further.
*c) framework for xlators to provide private metrics*
One possible solution is to use statedump functions. But to cause leastdisruption to an existing code, I propose 2 new methods.'dump_metrics()', and 'reset_metrics()' to xlator methods, which can bedl_open()'d to xlator structure.

If we create a framework for metrics, I would prefer that each xlatorregisters its metrics with the framework. This way there's no need foradditional functions to each xlator. Dump and reset will be done basedon the registered metrics.

'dump_metrics()' dumps the private metrics in the expected format, andwill be called from the global dump-metrics framework, and'reset_metrics()' would be called from a CLI command when someone wantsto restart metrics from 0 to check / validate few things in a runningcluster. Helps debug-ability.
Further feedback welcome.
NOTE: a sample code is already implemented in 'experimental' branch, andprotocol/server xlator uses this framework to dump metrics from rpclayer, and client connections.
*d) format of the 'metrics' file.*
If you want any plot-able data on a graph, you need key (should bestring), and value (should be a number), collected over time. So, thisfile should output data for the monitoring systems and not exactly forthe debug-ability. We have 'statedump' for debug-ability.
So, I propose a plain text file, where data would be dumped like below.

I agree. We could easily extract the values we want from the virtualdirectory and convert it to a plain text file in the desired form in atrivial way if necessary.

```
# anything starting from # would be treated as comment.
<key><space><value>
# anything after the value would be ignored.
```
Any better solutions are welcome. Ideally, we should keep this friendlyfor external projects to consume, like tendrl [8] or graphite,prometheus etc. Also note that, once we agree to the format, it would bevery hard to change it as external projects would use it.
I would like to hear the feedback from people who are experienced withmonitoring systems here.
NOTE: the above format works fine with 'glustermetrics' project [9] andis working decently on 'experimental' branch.
------

** Discussions:*

Let me know how you all want to take the discussion forward?
Should we get to github, and discuss on each issue? or should I rebaseand send the current patches from experimental to 'master' branch anddiscuss in our review system? Or should we continue on the email here!
Regards,
Amar

References:

[1] - https://github.com/gluster/glusterfs/issues/137
[2] - https://github.com/gluster/glusterfs/issues/141
[3] - https://github.com/gluster/glusterfs/issues/275
[4] - https://github.com/gluster/glusterfs/issues/168
[5] -http://lists.gluster.org/pipermail/maintainers/2017-August/002954.html(last email of the thread).
[6] - https://github.com/gluster/glusterfs/issues/303
[7] - https://github.com/gluster/glusterfs/issues/304
[8] - https://github.com/Tendrl
[9] - https://github.com/amarts/glustermetrics

--
Amar Tumballi (amarts)


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Metrics: and how to get them out from gluster

Reply via email to