nisiyong opened a new issue #6703:
URL: https://github.com/apache/skywalking/issues/6703


   Please answer these questions before submitting your issue.
   
   - Why do you submit this issue?
     - [ ] Question or discussion
     - [ ] Bug
     - [ ] Requirement
     - [x] Feature or performance improvement
   
   ___
   ### Requirement or improvement
   
   SkyWalking Java Agent is a powerful language instrument, it makes us build 
our tracing system more easily.
   
   We have used SkyWalking with our Java Applications in production serval 
mouths, it runs fine mostly. Recently, we found some applications occur with 
frequent GC and some occur OOM. We dump the memory heap and use [Memory 
Analyzer (MAT)](https://www.eclipse.org/mat/) find there has a lot of 
`TraceSegmentRef` Object in the heap. Here are two cases as follows:
   
   #### Case 1: Frequency GC
   
   In this case, the app has 1000 Dubbo handler threads, each handler will do a 
lot RPCs and DB operations.
   - JVM Max Heap:  8g
   - Machine: 8 core 16g
   - SkyWalking Agent: 8.4.0, collect all traces
   
   
![image](https://user-images.githubusercontent.com/8198862/113844121-bfac7000-97c6-11eb-8db2-580863c87cca.png)
   
   
![image](https://user-images.githubusercontent.com/8198862/113843193-f5048e00-97c5-11eb-8cd5-75a1ad543cc7.png)
   
   
   #### Case 2: OOM
   
   In this case, the app has 20 RocketMQ consume threads, in the consume 
thread, it will do some RPCs and DB operations.
   - JVM Max Heap:  8g
   - Machine: 8 core 16g
   - SkyWalking Agent: 8.4.0, collect all traces
   
   
![image](https://user-images.githubusercontent.com/8198862/113843899-912e9500-97c6-11eb-93db-4a955b9129fa.png)
   
   
![image](https://user-images.githubusercontent.com/8198862/113843489-31d08500-97c6-11eb-9492-9bfa28284c81.png)
   
   ---
   
   On the application side, I think there have 3 reasons:
   1. sudden high throughput will cause all threads busy to handle requests.
   2. each request handle has a lot of RPCs and DB operations, cause create a 
lot of spans
   3. Handle requests slowly, some will elapse 10s even more.
   
   On the agent side, I have read the source code and know some design:
   - The `Segment` in the SkyWalking concept, is the Object in the RingBuffer 
on the client-side, and SkyWalking has a consume thread consume the RingBuffer 
data send to the OAP.
   - Before put the `Segment` Object in the RingBuffer, will build it first. 
Each request will create some spans, and there are put in the stack data 
structure, the `Segment` will finish building utils the stack empty, which 
means the request in the application has finished. It will take some time. 
Meanwhile, the data will keep in the thread-local. And the garbage collector 
cannot collect them before the request finished.
   
   I wonder why put the segment in the ring buffer, could we put the span? I 
don't familiar with the Segment design purpose.
   And I know we should improve our application at the same time, but in some 
scenarios, people can tolerate it, even though handling requests slowly. So how 
SkyWalking Java Agent can do in such extreme scenarios? Because the application 
availability is very important, all of us won't hope the APM instrument 
occupies a lot of memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to