Hi Amar,

I don't have time to review the changes in experimental branch yet, but here are some comments about these ideas...

On 01/09/17 07:27, Amar Tumballi wrote:
Disclaimer: This email is long, and did take significant time to write. Do take time and read, review and give feedback, so we can have some metrics related tasks done by Gluster 4.0

---
** History:*

To understand what is happening inside GlusterFS process, over the years, we have opened many bugs and also coded few things with regard to statedump, and did put some effort into io-stats translator to improve the gluster's monitoring capabilities.

But surely there is more required! And some glimpse of it is captured in [1], [2], [3] & [4]. Also, I did send an email to this group [5] about possibilities of capturing this information.

** Current problem:*

When we talk about metrics or monitoring, we have to consider giving out these data to a tool which can preserve the readings in a periodic time, without a time graph, no metrics will make sense! So, the first challenge itself is how to get them out? Should getting the metrics out from each process need 'glusterd' interacting? or should we use signals? Which leads us to *'challenge #1'.*

One problem I see here is that we will have multiple bricks and multiple clients (including FUSE and gfapi).

I assume we want to be able to monitor whole volume performance (aggregate values of all mount points), specific mount performance, and even specific brick performance.

In this case, the signal approach seems quite difficult to me, specially for gfapi based clients. Even for fuse mounts and brick processes we would need to connect to each place where one of these processes is and send the signal there. In this case, some clients may be not prepared to be accessed remotely in an easy way.

Using glusterd this problem could be minimized, but I'm not sure that the interface would be easy to implement (basically because we would need some kind of filtering syntax to avoid huge outputs) and the output could be complex to parse for other tools, specially considering that the amount of data could be significant and it will can change with the addition or change of translators.

I propose a third approach. It's based on a virtual directory similar to /sys and /proc on linux. We already have /.meta in gluster. We could extend that in a way that we could have data there from each mount point (fuse of gfapi), and each brick. Then we could define an api to allow each xlator to publish information in that directory in a simple way.

Using this approach, monitor tools can check only the interesting data directly mounting the volume as any other client and reading the desired values.

To implement this we could centralize all statistics capturing in libglusterfs itself, and create a new translator (or reuse meta) to gather this information from libglusterfs and publish it into the virtual directory (probably we would need a server side and a client side xlator to be able to combine data from all mounts and bricks).


Next is, should we depend on io-stats to do the reporting? If yes, how to get information from between any two layers? Should we provide io-stats in between all the nodes of translator graph?

I whouldn't depend on io-stats for reporting all the data. The monitoring seems to me a deeper thing than what a single translator can do.

Using the virtual directory approach, io-stats can place its statistics there, but it doesn't need to be aware of all other possible statistics from other xlators because each one will report its own statistics independently.

or should we utilize STACK_WIND/UNWIND framework to get the details? This is our *'challenge #2'*

I think that gluster core itself (basically libglusterfs) should keep its own details on global things like this. This details could also be published in the virtual directory. From my point of view, io-stats should be left to provide global timings for the fops or be merged with the STACK_WIND/UNWIND framework and removed as an xlator.


Once the above decision will be taken, then the question is, "what about 'metrics' from other translators? Who gives it out (ie, dumps it?)? Why do we need something similar to statedump, and can't we read info from statedump itself?".

I think it would be better and easier to move the information from the statedump to the virtual directory instead of trying to use the statedump to report everything.

But when we say 'metrics', we should have a key and a number associated with it, statedump has lot more, and no format. If its different from statedump, then what is our answer for translator code to give out metrics? This is our *'challenge #3*'

Using the virtual directory structure, our key would be an specific file name in some directory that represents the hierarchical structure of the volume (xlators), and the value would be its contents.

Using this approach we could even allow some virtual files to be writable to trigger some action inside the whole volume, an specific mount or a brick, but this doesn't need to be considered right now.


If we get a solution to above challenges, then I guess we are in a decent shape for further development. Lets go through them one by one, in detail.

** Problems and proposed solutions:*

*a) how to dump metrics data ?*

Currently, I propose signal handler way, as it will give control for us to choose what are the processes we need to capture information on, and will be much faster than communicating through another tool. Also considering we need to have these metrics taken every 10sec or so, there will be a need for efficient way to get this out.

Probably this is not enough. One clear example is multiplexed bricks. We only have a single process, so a signal will dump information about all of them. How will we be able to get information only from a single brick ? we can process all the output, but this is unnecessary work when we only one a small piece of information.


But even there, we have challenges, because we have already chosen both USR1 and USR2 signal handlers, one for statedump, another for toggling latency monitoring respectively. It makes sense to continue to have statedump use USR1, but toggling options should be technically (for correctness too) be handled by glusterd volume set options, and there should be a way to handle it in a better way by our 'reconfigure()' framework in graph-switch. Proposal sent in github issue #303 [6].

If we are good with above proposal, then we can make use of USR2 for metrics dump. Next issue will be about the format of the file itself, which we will discuss at the end of the email.

NOTE: Above approach is already implemented in 'experimental' branch, excluding handling of [6].

*b) where to measure the latency and fops counts?*

One of the possible way is to load io-stats in between all the nodes, but it has its own limitations. Mainly, how to configure options in each of this translator, will having too many translators slow down operation ? (ie, create one extra 'frame' for every fop, and in a graph of 20 xlator, it will be 20 extra frame creates for a single fop).

As I said previously I don't like this approach either.


I propose we handle this in 'STACK_WIND/UNWIND' macros itself, and provide a placeholder to store all this data in translator structure itself. This will be more cleaner, and no changes are required in code base, other than in 'stack.h (and some in xlator.h)'.

I agree.


Also, we can provide 'option monitoring enable' (or disable) option as a default option for every translator, and can handle it at xlator_init() time itself. (This is not a blocker for 4.0, but good to have). Idea proposed @ github #304 [7].

I'm not sure if this is really necessary. As I understand it, monitoring will be based exclusively on counters. Updating a counter is really fast. Adding an option to disable it will mean that the code will need to check if this option is enabled or not before updating the counters, which is slower.

One thing we could do however, is to add options to the xlator that publishes the data to tell it what statistics to show in the virtual directory. This way we can globally ignore statistics reported by some xlator if we want, but without needing to put specific code into each translator to enable or disable it.


NOTE: this approach is working pretty good already at 'experimental' branch, excluding [7]. Depending on feedback, we can improve it further.

*c) framework for xlators to provide private metrics*

One possible solution is to use statedump functions. But to cause least disruption to an existing code, I propose 2 new methods. 'dump_metrics()', and 'reset_metrics()' to xlator methods, which can be dl_open()'d to xlator structure.

If we create a framework for metrics, I would prefer that each xlator registers its metrics with the framework. This way there's no need for additional functions to each xlator. Dump and reset will be done based on the registered metrics.


'dump_metrics()' dumps the private metrics in the expected format, and will be called from the global dump-metrics framework, and 'reset_metrics()' would be called from a CLI command when someone wants to restart metrics from 0 to check / validate few things in a running cluster. Helps debug-ability.

Further feedback welcome.

NOTE: a sample code is already implemented in 'experimental' branch, and protocol/server xlator uses this framework to dump metrics from rpc layer, and client connections.

*d) format of the 'metrics' file.*

If you want any plot-able data on a graph, you need key (should be string), and value (should be a number), collected over time. So, this file should output data for the monitoring systems and not exactly for the debug-ability. We have 'statedump' for debug-ability.

So, I propose a plain text file, where data would be dumped like below.

I agree. We could easily extract the values we want from the virtual directory and convert it to a plain text file in the desired form in a trivial way if necessary.


```
# anything starting from # would be treated as comment.
<key><space><value>
# anything after the value would be ignored.
```
Any better solutions are welcome. Ideally, we should keep this friendly for external projects to consume, like tendrl [8] or graphite, prometheus etc. Also note that, once we agree to the format, it would be very hard to change it as external projects would use it.

I would like to hear the feedback from people who are experienced with monitoring systems here.

NOTE: the above format works fine with 'glustermetrics' project [9] and is working decently on 'experimental' branch.

------

** Discussions:*

Let me know how you all want to take the discussion forward?

Should we get to github, and discuss on each issue? or should I rebase and send the current patches from experimental to 'master' branch and discuss in our review system? Or should we continue on the email here!

Regards,
Amar

References:

[1] - https://github.com/gluster/glusterfs/issues/137
[2] - https://github.com/gluster/glusterfs/issues/141
[3] - https://github.com/gluster/glusterfs/issues/275
[4] - https://github.com/gluster/glusterfs/issues/168
[5] - http://lists.gluster.org/pipermail/maintainers/2017-August/002954.html (last email of the thread).
[6] - https://github.com/gluster/glusterfs/issues/303
[7] - https://github.com/gluster/glusterfs/issues/304
[8] - https://github.com/Tendrl
[9] - https://github.com/amarts/glustermetrics

--
Amar Tumballi (amarts)


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Reply via email to