This is an automated email from the ASF dual-hosted git repository. wusheng pushed a commit to branch profiling in repository https://gitbox.apache.org/repos/asf/skywalking.git
commit 2ffc26d6d2f3794d71c0a5d43c7cf6ad0ce3689b Author: Wu Sheng <[email protected]> AuthorDate: Tue Nov 29 16:57:13 2022 +0800 Add docs for profiling, and adjust menu items. --- docs/en/changes/changes.md | 1 + docs/en/concepts-and-designs/profiling.md | 82 +++++++++++++++++++++++++++++++ docs/menu.yml | 10 ++-- 3 files changed, 89 insertions(+), 4 deletions(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index a615c064b7..412de8a128 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -197,5 +197,6 @@ * Add new docs for `Report Span Attached Events` data collecting protocol. * Add new docs for `Record` query protocol * Update `Server Agents` and `Compatibility` for PHP agent. +* Add docs for profiling. All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/149?closed=1) diff --git a/docs/en/concepts-and-designs/profiling.md b/docs/en/concepts-and-designs/profiling.md new file mode 100644 index 0000000000..d198a372aa --- /dev/null +++ b/docs/en/concepts-and-designs/profiling.md @@ -0,0 +1,82 @@ +# Profiling + +The profiling is an on-demand diagnosing method to locate bottleneck of the services. +These typical scenarios usually are suitable for profiling through various profiling tools + +1. Some methods slow down the API performance. +2. Too many threads and/or high-frequency I/O per OS process reduce the CPU efficiency. +3. Massive RPC requests block the network to cause responding slowly. +4. Unexpected network requests caused by security issues or codes' bug. + +In the SkyWalking landscape, we provided two ways to support profiling within reasonable resource cost. + +1. In-process profiling is bundled with auto-instrument agents. +2. Out-of-process profiling is powered by eBPF agent. + +## In-process profiling + +In-process profiling is primarily provided by auto-instrument agents in the VM-based runtime. +This feature resolves the issue <1> through capture the snapshot of the thread stacks periodically. +The OAP would aggregate the thread stack per RPC request, and provide a hierarchy graph to indicate the slow methods +based +on continuous snapshot. + +The period is usually every 10-100 milliseconds, which is not recommended to be less, due to this capture would usually +cause classical stop-the-world for the VM, which would impact the whole process performance. + +Learn more tech details from the post, [**Use Profiling to Fix the Blind Spot of Distributed +Tracing**](sdk-profiling.md). + +For now, Java and Python agents support this. + +## Out-of-process profiling + +Out-of-process profiling leverage [eBPF](https://ebpf.io/) technology with origins in the Linux kernel. +It provides a way to extend the capabilities of the kernel safely and efficiently. + +### On-CPU Profiling + +On-CPU profiling is suitable for analyzing thread stacks when service CPU usage is high. +If the stack is dumped more times, it means that the thread stack occupies more CPU resources. + +This is pretty similar with in-process profiling to resolve the issue <1>, but it is made out-of-process and based on +Linux eBPF. +Meanwhile, this is made for languages without VM mechanism, which caused not supported by in-process agents, such as, +C/C++, Rust. Golang is a special case, it exposed the metadata of the VM for eBPF, so, it could be profiled. + +### Off-CPU Profiling + +Off-CPU profiling is suitable for performance issues that are not caused by high CPU usage, but may be on high CPU load. +This profiling aims to resolve the issue <2>. + +For example, + +1. When there are too many threads in one service, using off-CPU profiling could reveal which threads spend + more time context switching. +2. Codes heavily rely on disk I/O or remote service performance would slow down the whole process. + +Off-CPU profiling provides two perspectives + +1. Thread switch count: The number of times a thread switches context. When the thread returns to the CPU, it completes + one context switch. A thread stack with a higher switch count spends more time context switching. +2. Thread switch duration: The time it takes a thread to switch the context. A thread stack with a higher switch + duration spends more time off-CPU. + +Learn more tech details about ON/OFF CPU profiling from the post, [**Pinpoint Service Mesh Critical Performance Impact +by using eBPF**](ebpf-cpu-profiling.md) + +### Network Profiling + +Network profiling captures the network packages to analysis traffic at L4(TCP) and L7(HTTP) to recognize network traffic +from a specific process or a k8s pod. Through this traffic analysis, locate the root causes of the issues <3> and <4>. + +Network profiling provides + +1. Network topology and identify processes. +2. Observe TCP traffic metrics with TLS status. +3. Observe HTTP traffic metrics. +4. Sample HTTP request/response raw data within tracing context. +5. Observe time costs for local I/O costing on the OS. Such as the time of Linux process HTTP request/response. + +Learn more tech details from the post, [**Diagnose Service Mesh Network Performance with +eBPF**](../academy/diagnose-service-mesh-network-performance-with-ebpf.md) \ No newline at end of file diff --git a/docs/menu.yml b/docs/menu.yml index 2f85c091c4..170d98787d 100644 --- a/docs/menu.yml +++ b/docs/menu.yml @@ -34,16 +34,18 @@ catalog: path: "/en/concepts-and-designs/service-agent" - name: "Manual Instrument SDK" path: "/en/concepts-and-designs/manual-sdk" - - name: "Backend" + - name: "Observability Analysis Platform" catalog: - name: "Overview" path: "/en/concepts-and-designs/backend-overview" - - name: "Observability Analysis Language" + - name: "Analysis Streaming Traces and Mesh Traffic" path: "/en/concepts-and-designs/oal" - - name: "Meter Analysis Language" + - name: "Analysis Metrics and Meters" path: "/en/concepts-and-designs/mal" - - name: "Log Analysis Language" + - name: "Analysis Logs" path: "/en/concepts-and-designs/lal" + - name: "Profiling" + path: "/en/setup/backend/profiling" - name: "Query in OAP" path: "/en/protocols/readme#query-protocol" - name: "Event"
