[skywalking] branch master updated: Add STAM paper to doc and Academy menu for listing important articles. (#8599)

wusheng Sat, 26 Feb 2022 06:43:24 -0800

This is an automated email from the ASF dual-hosted git repository.

wusheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking.git



The following commit(s) were added to refs/heads/master by this push:
     new 7e83ca9  Add STAM paper to doc and Academy menu for listing important 
articles. (#8599)
7e83ca9 is described below

commit 7e83ca9e59f7bf6ab8ad1ee7672800d636e6150a
Author: 吴晟 Wu Sheng <[email protected]>
AuthorDate: Sat Feb 26 22:43:07 2022 +0800

    Add STAM paper to doc and Academy menu for listing important articles. 
(#8599)
---
 CHANGES.md              |   8 ++-
 README.md               |   4 --
 docs/en/academy/list.md |  13 ++++
 docs/en/papers/stam.md  | 154 ++++++++++++++++++++++++++++++++++++++++++++++++
 docs/menu.yml           |   4 ++
 5 files changed, 176 insertions(+), 7 deletions(-)

diff --git a/CHANGES.md b/CHANGES.md
index 1c7f1dc..62940dd 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -72,12 +72,14 @@ Release Notes.
 
 * Remove unused jars (log4j-api.jar) in classpath.
 * Bump up netty version to fix CVE.
-* add Database Connection pool metric.
+* Add Database Connection pool metric.
 
 #### Documentation
 
-* update backend-alarm.md doc, support op "=" to "==".
-* update backend-meter.md doc .
+* Update backend-alarm.md doc, support op "=" to "==".
+* Update backend-meter.md doc .
+* Add <STAM: Enhancing Topology Auto Detection For A Highly Distributed and 
Large-Scale Application System> paper with CN version.
+* Add Academy menu for recommending articles.
 
 All issues and pull requests are 
[here](https://github.com/apache/skywalking/milestone/112?closed=1)
 
diff --git a/README.md b/README.md
index c89788c..f4e59bb 100644
--- a/README.md
+++ b/README.md
@@ -51,10 +51,6 @@ for better performance. Read [the paper of 
STAM](https://wu-sheng.github.io/STAM
 
 # Documentation
 - [Official documentation](https://skywalking.apache.org/docs/#SkyWalking)
-- [The paper of STAM](https://wu-sheng.github.io/STAM/), Streaming Topology 
Analysis Method.
-- 
[Blog](https://skywalking.apache.org/blog/2020-04-13-apache-skywalking-profiling/)
 about Use Profiling to Fix the Blind Spot of Distributed Tracing
-- 
[Blog](https://skywalking.apache.org/blog/2020-12-03-obs-service-mesh-with-sw-and-als/)
 about observing Istio + Envoy service mesh with ALS solution.
-- 
[Blog](https://skywalking.apache.org/blog/obs-service-mesh-vm-with-sw-and-als/) 
about observing Istio + Envoy service mesh with ALS Metadata-Exchange mechanism 
(in VMs and / or Kubernetes).
 
 NOTICE, SkyWalking 8.0+ uses [v3 protocols](docs/en/protocols/README.md). They 
are incompatible with previous releases.
 
diff --git a/docs/en/academy/list.md b/docs/en/academy/list.md
new file mode 100644
index 0000000..05d7cad
--- /dev/null
+++ b/docs/en/academy/list.md
@@ -0,0 +1,13 @@
+# Academy
+
+Academy is an article/video list recommended by the committer team.
+
+- [STAM Paper](../papers/stam.md) about the fundamental theory of SkyWalking 
tracing models.
+
+- 
[Blog](https://skywalking.apache.org/blog/2022-01-24-scaling-with-apache-skywalking/)
 about Scaling SkyWalking server automatically in kubernetes.
+
+- 
[Blog](https://skywalking.apache.org/blog/2020-04-13-apache-skywalking-profiling/)
 about Use Profiling to Fix the Blind Spot of Distributed Tracing.
+
+- 
[Blog](https://skywalking.apache.org/blog/2020-12-03-obs-service-mesh-with-sw-and-als/)
 about observing Istio + Envoy service mesh with ALS solution.
+
+- 
[Blog](https://skywalking.apache.org/blog/obs-service-mesh-vm-with-sw-and-als/) 
about observing Istio + Envoy service mesh with ALS Metadata-Exchange mechanism 
(in VMs and / or Kubernetes).
\ No newline at end of file
diff --git a/docs/en/papers/stam.md b/docs/en/papers/stam.md
new file mode 100644
index 0000000..67f907a
--- /dev/null
+++ b/docs/en/papers/stam.md
@@ -0,0 +1,154 @@
+# STAM: Enhancing Topology Auto Detection For A Highly Distributed and 
Large-Scale Application System
+
+- Sheng Wu 吴 晟
+- [email protected]
+
+### Editor's note
+This paper was written by Sheng Wu, project founder, in 2017, to describe the 
fundamental theory of all current
+agent core concepts.
+Readers could learn why SkyWalking agents are significantly different from 
other tracing system and 
+Dapper[1] Paper's description.
+
+# Abstract
+Monitoring, visualizing and troubleshooting a large-scale distributed system 
is a major challenge. One common tool used today is the distributed tracing 
system (e.g., Google Dapper)[1], and detecting topology and metrics based on 
the tracing data. One big limitation of today’s topology detection is that the 
analysis depends on aggregating the client-side and server-side tracing spans 
in a given time window to generate the dependency of services. This causes more 
latency and memory use, b [...]
+
+In this paper, we present the STAM, Streaming Topology Analysis Method. In 
STAM, we could use auto instrumentation or a manual instrumentation mechanism 
to intercept and manipulate RPC at both client-side and server-side. In the 
case of auto instrumentation, STAM manipulates application codes at runtime, 
such as Java agent. As such, this monitoring system doesn’t require any source 
code changes from the application development team or RPC framework development 
team. The STAM injects an R [...]
+
+The STAM has been implemented in the Apache SkyWalking[2], an open source APM 
(application performance monitoring system) project of the Apache Software 
Foundation, which is widely used in many big enterprises[3] including Alibaba, 
Huawei, Tencent, Didi, Xiaomi, China Mobile and other enterprises (airlines, 
financial institutions and others) to support their large-scale distributed 
systems in the production environment. It reduces the load and memory cost 
significantly, with better horiz [...]
+
+# Introduction
+Monitoring the highly distributed system, especially with a micro-service 
architecture, is very complex. Many RPCs, including HTTP, gRPC, MQ, Cache, and 
Database accesses, are behind a single client-side request. Allowing  the IT 
team to understand the dependency relationships among thousands of services is 
the key feature and first step for observability of a whole distributed system. 
A distributed tracing system is capable of collecting traces, including all 
distributed request paths.  [...]
+
+Strong timeliness is required to match the mutability of distributed 
application system dependency relationship, including service level and service 
instance level dependency.
+
+A Service is a logic group of instances which have the same functions or codes.
+
+A Service Instance is usually an OS level process, such as a JVM process. The 
relationships between services and instances are mutable, depending on the 
configuration, codes and network status. The dependency could change over time.
+
+<p align="center">
+<img src="https://skywalking.apache.org/papers/STAM/dapper-span.png"/>
+<br/>
+Figure 1, Generated spans in traditional Dapper based tracing system.
+</p>
+
+The span model in the Dapper paper and existing tracing systems，such as Zipkin 
instrumenting mode[9], just propagates the span id to the server side. Due to 
this model,
+dependency analysis  requires a certain time window. The tracing spans are  
collected at both client- and server-sides, because the relationship is 
recorded. Due to that, the analysis process has to wait for the client and 
server spans to match in the same time window, in order to output the result, 
Service A depending on Service B. So, this time window must be over the 
duration of this RPC request; otherwise, the conclusion will be lost. This 
condition makes the analysis would not react [...]
+Also, because of the Windows-based design, if one side involves a long 
duration task, it can’t easily achieve consistent accuracy. Because in order to 
make the analysis as fast as possible, the analysis period is less than 5 
minutes. But some spans can’t match its parent or children if the analysis is 
incomplete or crosses  two time windows. Even if we added a mechanism to 
process the spans left in the previous stages, still some would have to be 
abandoned to keep the dataset size and me [...]
+
+In the STAM, we introduce a new span and context propagation models, with the 
new analysis method. These new models add the peer network address (IP or 
hostname) used at client side, client service instance name and client service 
name, into the context propagation model. Then it passes the RPC call from 
client to server, just as the original trace id and span id in the existing 
tracing system, and collects it in the server-side span. The new analysis 
method can easily generate the clien [...]
+
+# New Span Model and Context Model
+The traditional span of a tracing system includes the following fields 
[1][6][10].
+- A trace id to represent the whole trace.
+- A span id to represent the current span.
+- An operation name to describe what operation this span did.
+- A start timestamp.
+- A finish timestamp
+- Service and Service Instance names of current span.
+- A set of zero or more key:value Span Tags.
+- A set of zero or more Span Logs, each of which is itself a key:value map 
paired with a timestamp.
+- References to zero or more causally related Spans. Reference includes the 
parent span id and trace id.
+
+In the new span model of STAM we add the following fields in the span.
+
+**Span type**. Enumeration, including exit, local and entry. Entry and Exit 
spans are used in a networking related library. Entry spans represent a 
server-side networking library, such as Apache Tomcat[7]. Exit spans represent 
the client-side networking library, such as Apache HttpComponents [8].
+
+**Peer Network Address**. Remote "address," suitable for use in exit and entry 
spans. In Exit spans, the peer network address is the address by the client 
library to access the server.
+
+These fields usually are optionally included in many tracing system,. But in 
STAM, we require them in all RPC cases.
+
+**Context Model** is used to propagate the client-side information to 
server-side carried by the original RPC call, usually in the header, such as 
HTTP header or MQ header. In the old design, it carries the trace id and span 
id of client-side span. In the STAM, we enhance this model, adding the parent 
service name, parent service instance name and peer of exit span. The names 
could be literal strings. All these extra fields will help to remove the block 
of streaming analysis. Compared to [...]
+
+The changes of two models could eliminate the time windows in the analysis 
process. Server-side span analysis enhances the context aware capability.
+
+# New Topology Analysis Method
+The new topology analysis method at the core of STAM is processing the span in 
stream mode.
+The analysis of the server-side span, also named entry span, includes the 
parent service name, parent service instance name and peer of exit span. So the 
analysis process could establish the following results.
+1.     Set the peer of exit span as client using alias name of current service 
and instance. `Peer network address <-> service name` and `peer network address 
<-> Service instance name` aliases created. These two will sync with all 
analysis nodes and persistent in the storage, allowing more analysis processers 
to have this alias information.
+2.     Generate relationships of `parent service name -> current service name` 
and `parent service instance name -> current service instance name`, unless 
there is another different `Peer network address <-> Service Instance Name` 
mapping found. In that case, only generate relationships of `peer network 
address <-> service name` and `peer network address <-> Service instance name`.
+
+For analysis of the client-side span (exit span), there could three 
possibilities.
+1.     The peer in the exit span already has the alias names established by 
server-side span analysis from step (1). Then use alias names to replace the 
peer, and generate traffic of `current service name -> alias service name` and 
`current service instance name -> alias service instance name`.
+2.     If the alias could not be found, then just simply generate traffic for 
`current service name -> peer` and `current service instance name -> peer`.
+3.     If multiple alias names of `peer network address <-> Service Instance 
Name` could be found, then keep generating traffic for `current service name -> 
peer network address` and `current service instance name -> peer network 
address`.
+
+<p align="center">
+<img 
src="https://skywalking.apache.org/papers/STAM/STAM-topo-in-apache-skywalking.png"/>
+<br/>
+Figure 2, Apache SkyWalking uses STAM to detect and visualize the topology of 
distributed systems.
+</p>
+
+# Evaluation
+In this section, we evaluate the new models and analysis method in the context 
of several typical cases in which the old method loses timeliness and 
consistent accuracy.
+
+- 1.**New Service Online or Auto Scale Out**
+
+New services could be added into the whole topology by the developer team 
randomly, or container operation platform automatically by some scale out 
policy, like Kubernetes [5]. The monitoring system could not be notified in any 
case manually. By using STAM, we could detect the new node automatically and 
also keep the analysis process unblocked and consistent with detected nodes.
+In this case, a new service and network address (could be IP, port or both) 
are used. The peer network address <-> service mapping does not exist, the 
traffic of client service -> peer network address will be generated and 
persistent in the storage first. After mapping is generated, further traffic of 
client-service to server-service could be identified, generated and aggregated 
in the analysis platform. For filling the gap of a few traffic before the 
mapping generated, we require doing  [...]
+
+<p align="center">
+<img src="https://skywalking.apache.org/papers/STAM/STAM-span-analysis.png"/>
+<br/>
+Figure 3, Span analysis by using the new topology analysis method
+</p>
+
+- 2.**Existing Uninstrumented Nodes**
+
+Every topology detection method has to work in this case. In many cases, there 
are nodes in the production environment that can’t be instrumented. Causes for 
this might include:(1) Restriction of the technology. In some golang or C++ 
written applications, there is no easy way in Java or .Net to do auto 
instrumentation by the agent. So, the codes may not be instrumented 
automatically. (2) The middleware, such as MQ, database server, has not adopted 
the tracing system. This would make it d [...]
+
+The STAM works well even if the client or server side has no instrumentation. 
It still keeps the topology as accurate as possible.
+
+If the client side hasn’t instrumented, the server-side span wouldn’t get any 
reference through RPC context, so, it would simply use peer to generate 
traffic, as shown in Figure 4.
+
+<p align="center">
+<img 
src="https://skywalking.apache.org/papers/STAM/STAM-no-client-instrumentation.png"/>
+<br/>
+Figure 4, STAM traffic generation when no client-side instrumentation
+</p>
+
+As shown in Figure 5, in the other case, with no server-side instrumentation, 
the client span analysis doesn’t need to process this case. The STAM analysis 
core just simply keeps generating client service->peer network address traffic. 
As there is no mapping for peer network address generated, there is no merging.
+
+<p align="center">
+<img 
src="https://skywalking.apache.org/papers/STAM/STAM-no-server-instrumentation.png"/>
+<br/>
+Figure 5, STAM traffic generation when no server-side instrumentation
+</p>
+
+- 3.**Uninstrumented Node Having Header Forward Capability**
+
+Besides the cases we evaluated in (2) Uninstrumented Nodes, there is one 
complex and special case: the instrumented node has the capability to propagate 
the header from downstream to upstream, typically in all proxy, such as 
Envoy[11], Nginx[12], Spring Cloud Gateway[13]. As proxy, it has the capability 
to forward all headers from downstream to upstream to keep some of information 
in the header, including the tracing context, authentication, browser 
information, and routing information,  [...]
+
+In this case, the proxy address would be used at the client side and propagate 
through RPC context as peer network address, and the proxy forwards this to 
different upstream services. Then STAM could detect this case and generate the 
proxy as a conjectural node. In the STAM, more than one alias names for this 
network address should be generated. After those two are detected and 
synchronized to the analysis node, the analysis core knows there is at least 
one uninstrumented service standin [...]
+
+<p align="center">
+<img 
src="https://skywalking.apache.org/papers/STAM/STAM-uninstrumentation-proxy.png"/>
+<br/>
+Figure 6, STAM traffic generation when the proxy uninstrumentatio
+</p>
+
+# Conclusion
+
+This paper described the STAM, which is to the best of our knowledge the best 
topology detection method for distributed tracing systems. It replaces the 
time-window based topology analysis method for tracing-based monitoring 
systems. It removes the resource cost of disk and memory for time-window baseds 
analysis permanently and totally, and the barriers of horizontal scale. One 
STAM implementation, Apache SkyWalking, is widely used for monitoring hundreds 
of applications in production. S [...]
+
+# Acknowledgments
+We thank all contributors of Apache SkyWalking project for suggestions, code 
contributions to implement the STAM, and feedback from using the STAM and 
SkyWalking in their production environment.
+
+# License
+This paper and the STAM are licensed in the Apache 2.0.
+
+# References
+
+1. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, 
https://research.google.com/pubs/pub36356.html?spm=5176.100239.blogcont60165.11.OXME9Z
+1. Apache SkyWalking, http://skywalking.apache.org/
+1. Apache Open Users, https://skywalking.apache.org/users/
+1. Zipkin, https://zipkin.io/
+1. Kubernetes, Production-Grade Container Orchestration. Automated container 
deployment, scaling, and management. https://kubernetes.io/
+1. OpenTracing Specification 
https://github.com/opentracing/specification/blob/master/specification.md
+1. Apache Tomcat, http://tomcat.apache.org/
+1. Apache HttpComponents, https://hc.apache.org/
+1. Zipkin doc, ‘Instrumenting a library’ section, ‘Communicating trace 
information’ paragraph. https://zipkin.io/pages/instrumenting
+1. Jaeger Tracing, https://jaegertracing.io/
+1. Envoy Proxy, http://envoyproxy.io/
+1. Nginx, http://nginx.org/
+1. Spring Cloud Gateway, https://spring.io/projects/spring-cloud-gateway
+1. Envoy Route Configuration, 
https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/rds.proto.html?highlight=request_headers_to_
diff --git a/docs/menu.yml b/docs/menu.yml
index 7873289..7cd6116 100644
--- a/docs/menu.yml
+++ b/docs/menu.yml
@@ -36,6 +36,8 @@ catalog:
             path: "/en/concepts-and-designs/manual-sdk"
           - name: "Service Mesh probe"
             path: "/en/concepts-and-designs/service-mesh-probe"
+          - name: "STAM Paper, Streaming Topology Analysis Method"
+            path: "/en/papers/stam"
       - name: "Backend"
         catalog:
           - name: "Overview"
@@ -187,6 +189,8 @@ catalog:
         path: "/en/protocols/readme"
       - name: "Query Protocol (GraphQL)"
         path: "/en/protocols/query-protocol"
+  - name: "Academy"
+    path: "en/academy/list"
   - name: "FAQs"
     path: "/en/FAQ/readme"
   - name: "Changelog"

[skywalking] branch master updated: Add STAM paper to doc and Academy menu for listing important articles. (#8599)

Reply via email to