wu-sheng opened a new issue #2773: [Proposal] OAP Federation Mode URL: https://github.com/apache/skywalking/issues/2773 # Federation ## Definition Federation is a new mode, SkyWalking OAP server could run. It is designed to support monitoring and aggregation metrics(no topology) across Clouds and Regions. ## Typical Scenario Service A could be deployed in C and N Clouds, the OPS team should set two OAP clusters to monitoring C and N Clouds, to get topology, metrics, traces, and alarm in both clouds. At the same time, OPS team wants to know the Service A overview metrics across the clouds, and set alarm based on that. This is the moment Federation works. ## Core Basically, Federation is just a particular mode of OAP server cluster, so it shares most of the codebases of current OAP, such as modulization, receiver, OAL, pluggable storage. The new of Federation OAP are as following ### `federation-forward` module and provider The new module and provider need to be added. SkyWalking recently has added `exporter` module and gRPC implementor, but it is the same as Federation. 1. The existing `exporter` report metrics total number. 1. The requirement of `Federation` is reporting increment of metrics, with the details(such as latency matrix of p99), not just the value. Because in Federation upstream, it should do the aggregation. Also, recommend to put `federation-forward` worker at `MetricsPersistentWorker#L102`, before query and do combine with db data. But here, we need a clone version of `metrics data`, to avoid concurrency manipulate. ### New Federation Receiver and Federation Protocol In Federation mode, SkyWalking OAP downstream cluster will talk with SkyWalking upstream cluster, then we need a new protocol(gRPC prefer) to report metrics with details. Also, in the protocol, we should consider extendable, because the metrics are generated by OAL function definition, such as `CPMMetrics`. ### New OAL source, scope, and function Because the existing functions are focusing on the aggregation of detail, the new functions need to be added to do aggregation of metrics. Such as how to aggregate `PercentMetrics`, we need a `PercentMetricsAdd` function to do so. ## Future Federation deployment could be multiple levels, such as 1. Set up federation for a region to support multiple clusters 1. Set up second level Federation for a data center 1. Set up third level Federation for the whole country. ## Robust and Performance Same as other design of SkyWalking, Federation forward just tries its best to deliver the metrics to upstream, no 100% guarantee. Federation forward could consider supporting MQ/Data file buffer to make it better. At stage 1, I prefer to do gRPC forward only, because even some data lost, it just lost several seconds metrics. But it should set up the extension points, like SPI or use module provider mechanism to make the extension easier. But at least, there should ba DataCarrier(blocking) queue in `federation-forward` to make sure gRPC stream mode works. ## TODO fix In `PersistenceTimer#L52`, the persistence execution interval is static, need to change that to configurable. ____ I look forward to receiving feedback about this new concept.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
