[GitHub] [skywalking-website] wu-sheng commented on a diff in pull request #614: Add blog for instruduce continuous profiling feature

via GitHub Sun, 25 Jun 2023 06:45:28 -0700


wu-sheng commented on code in PR #614:
URL: 
https://github.com/apache/skywalking-website/pull/614#discussion_r1241185580



##########
content/blog/2023-06-25-intruducing-continuous-profiling-skywalking-with-ebpf/index.md:
##########
@@ -0,0 +1,235 @@
+---
+title: "Activating Automatical Performance Analysis -- Continuous Profiling"
+date: 2023-06-25
+author: "Han Liu"
+description: "Introduce and demonstrate how SkyWalking implements eBPF-based 
process monitoring with few manual engagements. The profiling could be 
automatically activated driven by the preset conditions."
+tags:
+- eBPF
+- Profiling
+- Tracing
+---
+
+# Background
+
+In previous articles, We have discussed how to use SkyWalking and eBPF for 
performance problem detection within 
[processes](/blog/2022-07-05-pinpoint-service-mesh-critical-performance-impact-by-using-ebpf)
 and [networks](blog/diagnose-service-mesh-network-performance-with-ebpf). 
+However, there are still two outstanding issues:
+
+1. **The timing of the task initiation**: It's always challenging to address 
the processes that require performance monitoring when problems occur.
+Typically, manual engagement is required to identify processes and the types 
of performance analysis necessary, which cause extra time during the crash 
recovery.
+The root cause locating and the time of crash recovery conflict with each 
other from time to time. 
+In the real case, rebooting would be the first choice of recovery, meanwhile, 
it destroys the site of crashing.
+2. **Resource consumption of tasks**: The difficulties to determine the 
profiling scope. Wider profiling causes more resources than it should. 
+We need a method to manage resource consumption and understand which processes 
necessitate performance analysis.
+3. **Engineer capabilities**: On-call is usually covered by the whole team, 
which have junior and senior engineers, even senior engineers have their 
understanding limitation of the complex distributed system, 
+it is nearly impossible to understand the whole system by a single one person.
+
+The **Continuous Profiling** is a new created mechanism to resolve the above 
issues.
+
+# Mechanism
+
+If profiling tasks consume a significant amount of system resources, can we 
find alternative ways to monitor processes that use fewer system resources? The 
answer is yes. 
+Currently, SkyWalking establishes policy rules for specified target services, 
which are then monitored by the eBPF Agent in a low-energy manner. 
+When a policy match occurs, a profiling task is automatically triggered.
+
+## Policy
+
+Policy rules specify how to monitor target processes and determine the type of 
profiling task to initiate when certain threshold conditions are met.
+
+These policy rules primarily consist of the following configuration 
information:
+
+1. **Monitoring type**: This specifies what kind of monitoring should be 
implemented on the target process.
+2. **Threshold determination**: This defines how to determine whether the 
target process requires the initiation of a profiling task.
+3. **Trigger task**: This specifies what kind of performance analysis task 
should be initiated.
+
+### Monitoring type
+
+The type of monitoring is determined by observing the data values of a 
specified process to generate corresponding metrics. 
+These metric values can then facilitate subsequent threshold judgment 
operations. 
+In eBPF observation, we believe the following metrics can most directly 
reflect the current performance of the program:
+
+| Monitor Type | Unit | Description |
+|--------------|------|-------------|
+| System Load | Load | System load average over a specified period. |
+| Process CPU | Percentage | The CPU usage of the process as a percentage. |
+| Process Thread Count | Count | The number of threads in the process. |
+| HTTP Error Rate | Percentage | The percentage of HTTP requests that result 
in error responses (e.g., 4xx or 5xx status codes). |
+| HTTP Avg Response Time | Millisecond | The average response time for HTTP 
requests. |
+
+#### Network related monitoring
+
+Monitoring network type metrics is not as simple as obtaining basic process 
information. 
+It requires the initiation of eBPF programs and attaching them to the target 
process for observation. 
+This is similar to the principles of [network profiling task we introduced in 
the previous 
article](blog/diagnose-service-mesh-network-performance-with-ebpf), 
+except that we no longer collect the full content of the data packets. 
Instead, we only collect the content of messages that match specified HTTP 
prefixes.
+
+By using this method, we can significantly reduce the number of times the 
kernel sends data to the user space, 
+and the user-space program can parse the data content with less system 
resource usage. This ultimately helps in conserving system resources.
+
+#### Metrics collector
+
+When the eBPF Agent is monitoring a target process, it would report the 
collected data to the SkyWalking backend in the form of metrics. 
+This allows users to understand real-time execution status promptly.

Review Comment:
   ```suggestion
   The eBPF agent would report metrics of processes periodically as follows to 
indicate the process performance in time.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [skywalking-website] wu-sheng commented on a diff in pull request #614: Add blog for instruduce continuous profiling feature

Reply via email to