Hi I have read your issues and have modified them. If there are no problems basically, I think I can continue to write the next chapter.
Sheng Wu <[email protected]> 于2019年12月18日周三 上午10:53写道: > Hi > > I have finished the review. Please take a look. Most are good, just some > suggestions. > I think we are closed to start the separated code level tasks. > > Sheng Wu 吴晟 > Twitter, wusheng1108 > > > han liu <[email protected]> 于2019年12月15日周日 下午2:20写道: > > > I have answered and resolved all your questions in the doc. > > > > For your UI design proposal, I think this is cool, and I also made a > > prototype modification based on my understanding of this UI. > > To make the prototype more convenient to view, I put two prototype links > in > > the email for more convenient viewing: > > > > 1. add task modal:https://bwd5l7.axshare.com/#id=61ddhs&p=add-task-modal > > 2. thread monitor page: > > https://bwd5l7.axshare.com/#g=1&id=8sgf47&p=thread-monitor-page > > > > > > Sheng Wu <[email protected]> 于2019年12月14日周六 下午8:28写道: > > > > > Please update your whole documents, it seems a lot of mismatches exist > in > > > different sections. > > > > > > Sheng Wu 吴晟 > > > Twitter, wusheng1108 > > > > > > > > > han liu <[email protected]> 于2019年12月14日周六 下午2:54写道: > > > > > > > Okay, I have modified these two issues. > > > > > > > > 1. New field called maxThreadThresold. > > > > Added new input box to "add-task-modal" tab in prototype > > > > Added new conditions for limiting the size of the field in doc > > > chapter > > > > "Conditions that can create thread monitoring tasks" > > > > Added description of this field in doc chapter "Thread monitoring > > > table > > > > structures" > > > > > > > > 2. Modify endpoint id to the endpoint name. > > > > This field was modified in doc chapter "Thread monitoring table > > > > structures" > > > > Modify the input box to "add-task-modal" tab in prototype > > > > > > > > I will continue to modify for other issues in the comment. > > > > > > > > > > > > Sheng Wu <[email protected]> 于2019年12月14日周六 上午9:46写道: > > > > > > > > > han liu <[email protected]> 于2019年12月13日周五 下午10:40写道: > > > > > > > > > > > I see what you mean. I think this is a good feature, and I will > > > > summarize > > > > > > this into the design doc. > > > > > > > > > > > > If you have any good suggestions,please let me know. > > > > > > > > > > > > > > > > A simple way should be enough, set up sampling rule based on first > > span > > > > > operation name, rather than endpoint it. > > > > > In this case, plus #4056[1], there will be no id for local span or > > exit > > > > > span. but those two are used in the first span in the async > scenario. > > > > > > > > > > [1] https://github.com/apache/skywalking/issues/4056 > > > > > > > > > > > > > > > Sheng Wu 吴晟 > > > > > Twitter, wusheng1108 > > > > > > > > > > > > > > > > > > > > > > Sheng Wu <[email protected]> 于2019年12月13日周五 下午10:24写道: > > > > > > > > > > > > > Hi Han Liu > > > > > > > > > > > > > > One more reminder, a trace id in one instance could have > multiple > > > > > threads > > > > > > > sampling in theory, such as across threads scenarios. We also > > > should > > > > > set > > > > > > a > > > > > > > threshold for this. Max 3 threads for one trace id maybe? > > > > > > > > > > > > > > Sheng Wu 吴晟 > > > > > > > Twitter, wusheng1108 > > > > > > > > > > > > > > > > > > > > > Sheng Wu <[email protected]> 于2019年12月13日周五 下午1:03写道: > > > > > > > > > > > > > > > Hi Han Liu and everyone > > > > > > > > > > > > > > > > I have submitted a design draft to the doc. Please take a > look, > > > if > > > > > you > > > > > > > > have an issue, please let me known. We could set up a online > > > > meeting > > > > > > too. > > > > > > > > > > > > > > > > Sheng Wu <[email protected]>于2019年12月12日 周四下午8:49写道: > > > > > > > > > > > > > > > >> Hi Han Liu > > > > > > > >> > > > > > > > >> I have replied the design with the most important key > points I > > > > > expect. > > > > > > > >> Let's discuss those. After we are on the same page, we could > > > > > continue > > > > > > on > > > > > > > >> more details. > > > > > > > >> > > > > > > > >> Sheng Wu 吴晟 > > > > > > > >> Twitter, wusheng1108 > > > > > > > >> > > > > > > > >> > > > > > > > >> han liu <[email protected]> 于2019年12月12日周四 下午2:26写道: > > > > > > > >> > > > > > > > >>> Due to formatting issues with previous mailboxes, they have > > > been > > > > > > > replaced > > > > > > > >>> with new ones. > > > > > > > >>> > > > > > > > >>> I have completed some of the features in the google doc, > and > > > can > > > > > > > provide > > > > > > > >>> your comments and improvements. I will continue to improve > > the > > > > > > > following > > > > > > > >>> functions in the documentation. > > > > > > > >>> The documentation is the same as you previously sent me. To > > > > prevent > > > > > > > >>> trouble, I'll post the link again here. > > > > > > > >>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit# > > > > > > > >>> > > > > > > > >>> Sheng Wu <[email protected]> 于2019年12月10日周二 > > 上午10:46写道: > > > > > > > >>> > > > > > > > >>> > 741550557 <[email protected]> 于2019年12月9日周一 下午9:42写道: > > > > > > > >>> > > > > > > > > >>> > > Thank for your reply, the issues you mentioned are very > > > > > critical > > > > > > > and > > > > > > > >>> > > meaningful. > > > > > > > >>> > > There I will answer what you mentioned. Sorry, I'm not > > good > > > > at > > > > > > > >>> comment > > > > > > > >>> > > mode, so I use different colors and “ “ prefix to QA. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > As we already have designed limit mechanism at backend > > and > > > > > agent > > > > > > > >>> > > side(according to your design), also the number would > > not > > > be > > > > > > > big(10 > > > > > > > >>> most > > > > > > > >>> > > likely), we just need a list to storage the > trace-id(s) > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > If just need a list to storage trace-id(s), so how can > I > > > map > > > > to > > > > > > the > > > > > > > >>> > > thread? I hope to use the map to quickly find thread > info > > > > from > > > > > > > >>> trace-id. > > > > > > > >>> > > How can I get thread-stack information from your way? > > Could > > > > you > > > > > > > >>> please > > > > > > > >>> > > help elaborate? > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > Why do you need to do that? You just save a list of > thread > > > ids > > > > > > which > > > > > > > >>> should > > > > > > > >>> > do thread dump, or remove some thread id from them when > the > > > > trace > > > > > > id > > > > > > > is > > > > > > > >>> > finished. > > > > > > > >>> > This is easy to do this by doing a loop search in the > list. > > > > > Right? > > > > > > > >>> > Thread-stack is in the list, they are stored in an > element. > > > > Also, > > > > > > > they > > > > > > > >>> are > > > > > > > >>> > in a list too. > > > > > > > >>> > > > > > > > > >>> > I think you were thinking the same all stack in a single > > map? > > > > > That > > > > > > > will > > > > > > > >>> > cause a very dangerous memory and GC risk. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > Could you explain the (2), what do you mean `stop`? I > > > think > > > > if > > > > > > > your > > > > > > > >>> > > sampling mechanism should include the sampling > duration. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > As far as the communication between the sniffer and the > > OAP > > > > > > > server, I > > > > > > > >>> > hope > > > > > > > >>> > > the sniffer only needs to obtain the thread-monitor > task > > > that > > > > > > needs > > > > > > > >>> to be > > > > > > > >>> > > monitored at this time. The termination condition can > be > > > > > stopped > > > > > > by > > > > > > > >>> the > > > > > > > >>> > > sniffer or the OAP server. > > > > > > > >>> > > If It’s just an OAP server notification, it may be more > > > > > > > complicated. > > > > > > > >>> > Cause > > > > > > > >>> > > OAP server need record sniffer has received the current > > > > > command, > > > > > > > and > > > > > > > >>> > > sniffer is not stable, such as sniffer has shutdown > when > > > > > > receiving > > > > > > > >>> the > > > > > > > >>> > > command, at this time, no thread information I have > been > > > > > > collected. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I think that the active calculation termination by the > > OAP > > > > > server > > > > > > > can > > > > > > > >>> > make > > > > > > > >>> > > the monitoring more controllable, of course, the client > > can > > > > > also > > > > > > > >>> actively > > > > > > > >>> > > report the end. > > > > > > > >>> > > I think it’s necessary to provide a protection > mechanism > > > for > > > > > the > > > > > > > >>> sniffer > > > > > > > >>> > > side, and it can be released quickly when the business > > peak > > > > > > period > > > > > > > >>> or the > > > > > > > >>> > > probe suddenly occupies a lot of CPU / memory > resources. > > > > > > Therefore, > > > > > > > >>> the > > > > > > > >>> > > function of stopping monitoring can be provided in the > UI > > > > > > > interface, > > > > > > > >>> so > > > > > > > >>> > > that the sniffer can recover. > > > > > > > >>> > > Sampling duration is required, but only as a default > > > > > termination > > > > > > > >>> > > thread-monitor condition. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > But you should know, in the real case, the thread dump > > > monitor > > > > > is a > > > > > > > >>> > sampling mechanism, you are even hard to know where they > > are > > > > > > > happening. > > > > > > > >>> > Then you have to send the stop notification to every > > > instance. > > > > > > > >>> > Even you could send the notification, but could you > explain > > > how > > > > > you > > > > > > > >>> know to > > > > > > > >>> > stop? > > > > > > > >>> > The scenario is, you are facing an issue, which trace and > > > > metrics > > > > > > > can't > > > > > > > >>> > explain, so you active thread dump, right? At the same > > time, > > > > you > > > > > > want > > > > > > > >>> to > > > > > > > >>> > stop? > > > > > > > >>> > > > > > > > > >>> > CPU and memory resources should be guaranteed by design > > > level, > > > > > such > > > > > > > as > > > > > > > >>> > 1. Limited thread dump task for one service. > > > > > > > >>> > 2. Limited thread dump traces in the certain time window. > > > > > > > >>> > For example, the OAP backend/UI would say, you only could > > > > > > > >>> > 1. Set 3 thread dump commands in the same time window. > > > > > > > >>> > 2. Every command will require the sampling thread dump > > number > > > > > > should > > > > > > > be > > > > > > > >>> > less than 5 traces. At the same time, in order to make > this > > > > > > sampling > > > > > > > >>> works, > > > > > > > >>> > only active sampling thread dump after the trace executed > > > more > > > > > than > > > > > > > >>> > 200ms(value is an example only). > > > > > > > >>> > 3. Thread dump could be sent to the backend duration > > sampling > > > > to > > > > > > > >>> reduce the > > > > > > > >>> > memory cache. > > > > > > > >>> > 4. Thread dump period should not less than 5ms, recommend > > > 20ms > > > > > > > >>> > 5. How depth the thread dump should do > > > > > > > >>> > > > > > > > > >>> > We need a very detailed design, above are just my > thoughts, > > > in > > > > > > order > > > > > > > to > > > > > > > >>> > share the idea, the safe of the agent should not be by UI > > > > button. > > > > > > > >>> > Otherwise, your online system will be very dangerous, > which > > > is > > > > > not > > > > > > > the > > > > > > > >>> > design goal of SkyWalking. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > The sampling period depends on how you are going to > > > > visualize > > > > > > it. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > Yes, I agree. I hope can provide a select/input let > trace > > > > count > > > > > > and > > > > > > > >>> time > > > > > > > >>> > > windows can be configurable in UI. Of course, this is > my > > > > > current > > > > > > > >>> idea, > > > > > > > >>> > and > > > > > > > >>> > > if there have other plains, I will adopt it. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > Highly doubt about this, reduce the memory, maybe, > only > > > > reduce > > > > > > if > > > > > > > >>> the > > > > > > > >>> > > codes > > > > > > > >>> > > are running the loop or facing lock issue. But if it > is > > > > > neither > > > > > > of > > > > > > > >>> these > > > > > > > >>> > > two, they are different. > > > > > > > >>> > > Also, please consider the CPU cost of the comparison > of > > > the > > > > > > stack. > > > > > > > >>> You > > > > > > > >>> > > need > > > > > > > >>> > > a performance benchmark to verify if you want this. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I didn’t understand that first sentence. In my personal > > > > > > experience, > > > > > > > >>> most > > > > > > > >>> > > of the cases are blocking in the lock(socket/local) and > > > > running > > > > > > > >>> loop. I > > > > > > > >>> > > have not imagined any other cases? > > > > > > > >>> > > For the second sentence, I think I can add a > > > > > thread-stack-element > > > > > > > >>> field > > > > > > > >>> > to > > > > > > > >>> > > storage the top-level element of last stack > information. > > > When > > > > > get > > > > > > > >>> stack > > > > > > > >>> > > information next time, I can compare the current > > top-level > > > > > > element > > > > > > > >>> that > > > > > > > >>> > is > > > > > > > >>> > > the same with that field. > > > > > > > >>> > > I do this mainly to reduce duplicate thread-stack > > > information > > > > > > form > > > > > > > >>> taking > > > > > > > >>> > > up too much memory space, this is a way to optimizing > > > memory > > > > > > space. > > > > > > > >>> It > > > > > > > >>> > can > > > > > > > >>> > > consider remove it, or do you have a better > memory-saving > > > > > > solution? > > > > > > > >>> After > > > > > > > >>> > > all, memory and CPU resources are very valuable in the > > > > sniffer. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > I know you mean about reducing the memory, but do you > > > consider > > > > > how > > > > > > > >>> much CPU > > > > > > > >>> > you will cost do a full thread dump comparison? The > thread > > > dump > > > > > > could > > > > > > > >>> > easily be hundreds of lines in Java. > > > > > > > >>> > I mean this is a tradeoff, CPU or memory. If you are just > > > using > > > > > > > limited > > > > > > > >>> > memory, before you could send the snapshot to backend > while > > > > > > > collecting > > > > > > > >>> new, > > > > > > > >>> > even could save into the disk(if really necessary). > > > > > > > >>> > In my experience, compress is always very high risk in > the > > > > agent, > > > > > > if > > > > > > > >>> you > > > > > > > >>> > want to do that, you need a benchmark test to improve > that, > > > > this > > > > > > CPU > > > > > > > >>> cost > > > > > > > >>> > is small enough. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > The trace number and time window should be > configurable, > > > > that > > > > > > is I > > > > > > > >>> mean > > > > > > > >>> > > more complex. Inthe current SamplingServcie, only n > > traces > > > > > per 3 > > > > > > > >>> > seconds. > > > > > > > >>> > > But here, it is a dynamic rule. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I expect that it can be configured at the UI level for > > > > special > > > > > > > trace > > > > > > > >>> > count > > > > > > > >>> > > and time windows as I said above. > > > > > > > >>> > > For SamplingService, my previous tech design was not > > > rigorous > > > > > > > >>> enough, and > > > > > > > >>> > > there were indeed problems. > > > > > > > >>> > > Maybe we need to extend a new SamplingService, build a > > > > mapping > > > > > > base > > > > > > > >>> on > > > > > > > >>> > > endpoint-id and AtomicInteger. > > > > > > > >>> > > For `first 5 traces of this endpoint in the next 5 > mins`, > > > > just > > > > > > need > > > > > > > >>> to > > > > > > > >>> > > increment it. > > > > > > > >>> > > For sampling, maybe use another schedule task to reset > > > > > > > AtomicInteger > > > > > > > >>> > value. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > You could avoid map, by using ArrayList with > > > > > > > >>> RangeAtomicInteger(SkyWalking > > > > > > > >>> > provides that) to let the trace context to get the slot. > > > > > > > >>> > Also, you are considering `active sampling after trace > > > > execution > > > > > > time > > > > > > > >>> more > > > > > > > >>> > than xxx ms`, you should add remove mechanism during > > runtime. > > > > > > > >>> > Anyway, try your best to avoid using Map, especially this > > map > > > > > could > > > > > > > be > > > > > > > >>> > changed in the runtime. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I think at least should be a level one new page called > > > > > > > >>> configuration or > > > > > > > >>> > > command page, which could set up the multiple sampling > > > rule > > > > > and > > > > > > > >>> > visualize > > > > > > > >>> > > the existing tasks and related sampling data. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I think it’s necessary to add a new page to the > > > configuration > > > > > > > >>> > > thread-monitor task, I think the specific UI display > > should > > > > be > > > > > > > >>> designed > > > > > > > >>> > in > > > > > > > >>> > > detail. > > > > > > > >>> > > For example, what I expected is similar to the trace > > page. > > > > The > > > > > > left > > > > > > > >>> side > > > > > > > >>> > > displays the configuration, and the right side quickly > > > > displays > > > > > > the > > > > > > > >>> > related > > > > > > > >>> > > trace list. When clicked, it quickly links to the trace > > > page > > > > > and > > > > > > > >>> displays > > > > > > > >>> > > the sidebox display. > > > > > > > >>> > > I ’m not good at this. Do you have any good plans? > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > UI is the thing that is hard to discuss by text, so I am > > > pretty > > > > > > sure, > > > > > > > >>> we > > > > > > > >>> > need some demo(could not be the codes, that is I mean > drew > > > by a > > > > > > tool) > > > > > > > >>> > It is OK to show a trace with thread dumps on another > page, > > > > even > > > > > > > better > > > > > > > >>> > linking to your task ID. > > > > > > > >>> > But this kind of abstract description is hard to > continue, > > no > > > > > > > details I > > > > > > > >>> > mean. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > And I feel that the two of us have a different > > > understanding > > > > of > > > > > > the > > > > > > > >>> > > configuration object. I think it is more of a task > than a > > > > > > command. > > > > > > > I > > > > > > > >>> > don't > > > > > > > >>> > > know which way is better? > > > > > > > >>> > > I suddenly thought of a problem. I think that some real > > > > > problems > > > > > > > are > > > > > > > >>> > often > > > > > > > >>> > > triggered at a specific period, such as a fixed > business > > > peak > > > > > > > >>> period, and > > > > > > > >>> > > we cannot guarantee that the user will operate on the > UI. > > > > > > > >>> > > So should the task mechanism be adopted to ensure that > it > > > can > > > > > be > > > > > > > run > > > > > > > >>> at a > > > > > > > >>> > > specific period? > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > This makes sense to me, and it is a just enhance feature. > > It > > > is > > > > > > just > > > > > > > a > > > > > > > >>> > start time sampling rule. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > We don't have separated thread monitor view table, how > > > about > > > > > we > > > > > > > add > > > > > > > >>> an > > > > > > > >>> > > icon > > > > > > > >>> > > at the segment list, and add icon at the first span of > > > this > > > > > > > segment > > > > > > > >>> in > > > > > > > >>> > > trace detail view? > > > > > > > >>> > > I think the latter one should be an entrance of the > > thread > > > > > view. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I think it's a good idea. The link I mentioned in one > of > > > the > > > > > > > answers > > > > > > > >>> > > above, I think it is also a convenient entry point. > > > > > > > >>> > > The switch button I mentioned earlier is only a data > > > > filtering > > > > > > item > > > > > > > >>> in > > > > > > > >>> > the > > > > > > > >>> > > query of the trace list and does not need a separate > > table > > > > UI. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > As you intend to have a separated page for thread > sampling, > > > it > > > > is > > > > > > OK > > > > > > > to > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > If you have some visualization idea, drawn by any tool > > you > > > > > like > > > > > > > >>> > supporting > > > > > > > >>> > > comment, we could discuss it there. In my mind, we > > should > > > > > > support > > > > > > > >>> > > visualize > > > > > > > >>> > > the thread dump stack through the time windows, and > > > support > > > > > > > >>> aggregate > > > > > > > >>> > them > > > > > > > >>> > > by choosing the continued stack snapshots on the time > > > > window. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > I think we should find a front-end who is better at > > > > discussing > > > > > > > >>> together > > > > > > > >>> > > because this depends on how the front-end UI can be > > > > displayed. > > > > > > > >>> > > BTW: I can provide code for the OAP server and sniffer, > > and > > > > the > > > > > > > >>> frontend > > > > > > > >>> > > may need to look for help in the community alone. Hope > > that > > > > any > > > > > > > >>> front-end > > > > > > > >>> > > friends can participate in the topic discussion. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > Once you have the demo, I could loop our UI committers in > > for > > > > UI > > > > > > side > > > > > > > >>> > development. But UI committers may not be familiar with > > > thread > > > > > dump > > > > > > > >>> context > > > > > > > >>> > story. We need to resolve that first. > > > > > > > >>> > Let's start up a demo, such as some slides on Google doc? > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > The above is my answer to all the questions, and I look > > > > forward > > > > > > to > > > > > > > >>> your > > > > > > > >>> > > reply at any time. As more and more discussions took > > place, > > > > the > > > > > > > >>> details > > > > > > > >>> > > became more and more complete. This is good. > > > > > > > >>> > > Everyone is welcome to discuss together if you have any > > > > > questions > > > > > > > or > > > > > > > >>> good > > > > > > > >>> > > ideas, please let me know. > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > I think we could move the discussion to the design doc as > > the > > > > > next > > > > > > > >>> step. > > > > > > > >>> > > > > > > > > >>> > Please use this > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit# > > > > > > > >>> > Trite the design including > > > > > > > >>> > 1. Key features > > > > > > > >>> > 2. Protocol > > > > > > > >>> > 3. Work mechanism > > > > > > > >>> > 4. UI design, prototype > > > > > > > >>> > and anything you think important before writing codes. > > > > > > > >>> > > > > > > > > >>> > This is SkyWalking CLI design doc, you could use it as a > > > > > reference. > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1WBnRNF0ABxaSdBZo6Gv2hMzCQzj04YAePUdOyLWHWew/edit# > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > 原始邮件 > > > > > > > >>> > > 发件人:Sheng [email protected] > > > > > > > >>> > > 收件人:[email protected] > > > > > > > >>> > > 发送时间:2019年12月9日(周一) 10:50 > > > > > > > >>> > > 主题:Re: A proposal for Skywalking(thread monitor) > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > Hi Thanks for writing this proposal with a detailed > > design. > > > > My > > > > > > > >>> comments > > > > > > > >>> > > are inline. 741550557 [email protected] 于2019年12月8日周日 > > > > > 下午11:22写道: > > > > > > > >>> Thanks > > > > > > > >>> > > for your reply, I have carefully read these issues you > > > > > mentioned, > > > > > > > >>> and > > > > > > > >>> > > these issues mentioned are very meaningful and > critical. > > I > > > > will > > > > > > > >>> give you > > > > > > > >>> > > technical details about the issues you mentioned below. > > I > > > > find > > > > > > > these > > > > > > > >>> > > issues are related, so I will explain them in different > > > > > > > dimensions. > > > > > > > >>> > use > > > > > > > >>> > > a different protocol to transmission trace and > > > thread-stack: > > > > > 1. > > > > > > > add > > > > > > > >>> a > > > > > > > >>> > > boolean field in segment data, to record has thread > > > > monitored. > > > > > > and > > > > > > > >>> is > > > > > > > >>> > good > > > > > > > >>> > > for filter monitored trace in UI. 2. add new > > BootService, > > > > > > storage > > > > > > > >>> Map to > > > > > > > >>> > > record relate trace-id and trace-stack information. > As > > we > > > > > > already > > > > > > > >>> have > > > > > > > >>> > > designed limit mechanism at backend and agent > > > side(according > > > > to > > > > > > > your > > > > > > > >>> > > design), also the number would not be big(10 most > > likely), > > > we > > > > > > just > > > > > > > >>> need a > > > > > > > >>> > > list to storage the trace-id(s) 3. listen > > > > > > > >>> > > TracingContextListener#afterFinished if the current > > segment > > > > has > > > > > > > >>> thread > > > > > > > >>> > > monitored, mark current trace-id don’t need to monitor > > > > anymore. > > > > > > > >>> (Cause > > > > > > > >>> > if > > > > > > > >>> > > for-each the step 2 map, the remove operation will fail > > and > > > > > throw > > > > > > > >>> > > exception). 4. when thread-monitor main thread > running, > > It > > > > > will > > > > > > > >>> for-each > > > > > > > >>> > > step 2 map and check is it don’t need monitor > anymore, I > > > > will > > > > > > put > > > > > > > >>> data > > > > > > > >>> > > into new data carrier. 5. generate new thread-monitor > > > gRPC > > > > > > > >>> protocol to > > > > > > > >>> > > send data from the data carrier. The agent side design > > > seems > > > > > > > pretty > > > > > > > >>> > good. > > > > > > > >>> > > the server receives thread-stack logic: 1. storage > > > > > stack-stack > > > > > > > >>> > > informations and trace-id/segment-id relations on a > > > > different > > > > > > > >>> table. 2. > > > > > > > >>> > > check thread-monitor is need to be stop on receiving > data > > > or > > > > > > > >>> schedule. > > > > > > > >>> > > Could you explain the (2), what do you mean `stop`? I > > think > > > > if > > > > > > your > > > > > > > >>> > > sampling mechanism should include the sampling > duration. > > > > > > reduce > > > > > > > >>> CPU > > > > > > > >>> > and > > > > > > > >>> > > memory in sniffer: 1. through the configuration of > > thread > > > > > > > >>> monitoring in > > > > > > > >>> > > the UI, you can configure the performance loss. For > > > example, > > > > > set > > > > > > > the > > > > > > > >>> > > monitoring level: fast monitoring (100ms), medium > speed > > > > > > monitoring > > > > > > > >>> > > (500ms), slow speed monitoring (1000ms). The sampling > > > > period > > > > > > > >>> depends on > > > > > > > >>> > > how you are going to visualize it. 2. add new integer > > > field > > > > on > > > > > > per > > > > > > > >>> > > thread-stack, if current thread-stack last element > same > > as > > > > > last > > > > > > > >>> time, > > > > > > > >>> > > don’t need storage, just increment it. I think it will > > > save > > > > a > > > > > > lot > > > > > > > of > > > > > > > >>> > > memory space. Highly doubt about this, reduce the > memory, > > > > > maybe, > > > > > > > only > > > > > > > >>> > > reduce if the codes are running the loop or facing lock > > > > issue. > > > > > > But > > > > > > > >>> if it > > > > > > > >>> > is > > > > > > > >>> > > neither of these two, they are different. Also, please > > > > consider > > > > > > the > > > > > > > >>> CPU > > > > > > > >>> > > cost of the comparison of the stack. You need a > > performance > > > > > > > >>> benchmark to > > > > > > > >>> > > verify if you want this. 3. create new VM args to > setting > > > > > > > >>> thread-monitor > > > > > > > >>> > > pool size, It dependence on user, maybe default 3? > (this > > > can > > > > > be > > > > > > > >>> > discussed > > > > > > > >>> > > later) I think UI limit is enough. 3 seems good to me. > > 4. > > > > > limit > > > > > > > >>> > > thread-stack-element size to 100, I think it can > resolve > > > most > > > > > of > > > > > > > the > > > > > > > >>> > > scenes already. It also can create a new VM args if > need. > > > > > > > multiple > > > > > > > >>> > > sampling methods can choose :(just my current thoughts, > > can > > > > add > > > > > > > >>> more) > > > > > > > >>> > 1. > > > > > > > >>> > > base on current client SamplingServcie, extra a new > > factor > > > > > holder > > > > > > > to > > > > > > > >>> > > increment, and reset on schedule. Yours may be a > little > > > more > > > > > > > complex > > > > > > > >>> > than > > > > > > > >>> > > the current SamplingServcie, right? Based on the next > > rule. > > > > 2. > > > > > > > >>> `first 5 > > > > > > > >>> > > traces of this endpoint in the next 5 mins`, it a good > > > idea. > > > > My > > > > > > > >>> > > understanding is that within a few minutes, each > instance > > > can > > > > > > send > > > > > > > a > > > > > > > >>> > > specified number of traces. The trace number and time > > > window > > > > > > > should > > > > > > > >>> be > > > > > > > >>> > > configurable, that is I mean more complex. Inthe > current > > > > > > > >>> SamplingServcie, > > > > > > > >>> > > only n traces per 3 seconds. But here, it is a dynamic > > > rule. > > > > > > UI > > > > > > > >>> > settings > > > > > > > >>> > > and sniffer perception: 1. create a new button on the > > > > > dashboard > > > > > > > >>> page, It > > > > > > > >>> > > can create or stop a thread-monitor. It can be dynamic > > > load > > > > > > > >>> > thread-monitor > > > > > > > >>> > > status when reselecting endpoint. I think at least > > should > > > > be > > > > > a > > > > > > > >>> level > > > > > > > >>> > one > > > > > > > >>> > > new page called configuration or command page, which > > could > > > > set > > > > > up > > > > > > > the > > > > > > > >>> > > multiple sampling rule and visualize the existing tasks > > and > > > > > > related > > > > > > > >>> > > sampling data. 2. sniffer creates a new scheduled task > > to > > > > > check > > > > > > > the > > > > > > > >>> > > current service has need monitor endpoint each 5 > > seconds. > > > (I > > > > > see > > > > > > > >>> current > > > > > > > >>> > > sniffer has command functions, feel that principle is > > the > > > > same > > > > > > as > > > > > > > >>> the > > > > > > > >>> > > scheduler) Seems reasonable. thread-monitor on the > > > > > UI:(That’s > > > > > > > >>> just my > > > > > > > >>> > > initial thoughts, I think there will have a better way > > to > > > > > show) > > > > > > > 1. > > > > > > > >>> When > > > > > > > >>> > > switch to the trace page, I think we need to add a new > > > switch > > > > > > > >>> button to > > > > > > > >>> > > filter thread-monitor trace. 2. make a new > > thread-monitor > > > > icon > > > > > > on > > > > > > > >>> the > > > > > > > >>> > same > > > > > > > >>> > > segment. It means it has thread-stack information. We > > > don't > > > > > > have > > > > > > > >>> > > separated thread monitor view table, how about we add > an > > > icon > > > > > at > > > > > > > the > > > > > > > >>> > > segment list, and add icon at the first span of this > > > segment > > > > in > > > > > > > trace > > > > > > > >>> > > detail view? I think the latter one should be an > entrance > > > of > > > > > the > > > > > > > >>> thread > > > > > > > >>> > > view. 3. show on the information sidebox when the user > > > clicks > > > > > the > > > > > > > >>> > > thread-monitor segment(any span). create a new tab, > like > > > the > > > > > log > > > > > > > >>> tab. > > > > > > > >>> > If > > > > > > > >>> > > you have some visualization idea, drawn by any tool you > > > like > > > > > > > >>> supporting > > > > > > > >>> > > comment, we could discuss it there. In my mind, we > should > > > > > support > > > > > > > >>> > visualize > > > > > > > >>> > > the thread dump stack through the time windows, and > > support > > > > > > > aggregate > > > > > > > >>> > them > > > > > > > >>> > > by choosing the continued stack snapshots on the time > > > window. > > > > > > > >>> They're > > > > > > > >>> > > just a description of my current implementation details > > for > > > > > > > >>> > thread-monitor > > > > > > > >>> > > if these seem to work. I can do some time planning for > > > these > > > > > > > tasks. > > > > > > > >>> > Sorry, > > > > > > > >>> > > my English is not very well, hope you can understand. > > Maybe > > > > > > these > > > > > > > >>> seem > > > > > > > >>> > to > > > > > > > >>> > > have some problem, any good idea or suggestion are > > welcome. > > > > > Very > > > > > > > >>> > > appreciated you to lead this new direction. It is a > long > > > term > > > > > > task > > > > > > > >>> but > > > > > > > >>> > > should be interesting. :) Good work, carry on. > 原始邮件 > > > > > > 发件人:Sheng > > > > > > > >>> > > [email protected] 收件人: > > > > [email protected] > > > > > > > >>> > > 发送时间:2019年12月8日(周日) 08:31 主题:Re: A proposal for > > > > > > Skywalking(thread > > > > > > > >>> > > monitor) First of all, thanks for your proposal. > > Thread > > > > > > > >>> monitoring is > > > > > > > >>> > > super important for application performance. So > > > basically, I > > > > > > agree > > > > > > > >>> with > > > > > > > >>> > > this proposal. But for tech details, I think we need > > more > > > > > > > >>> discussion in > > > > > > > >>> > > the following ways 1. Do you want to add thread status > > to > > > > the > > > > > > > >>> trace? If > > > > > > > >>> > > so, why don't consider this as a UI level join? > Because > > we > > > > > could > > > > > > > >>> know > > > > > > > >>> > > thread id in the trace when we create a span, right? > > Then > > > we > > > > > > have > > > > > > > >>> all > > > > > > > >>> > the > > > > > > > >>> > > thread dump(if), we could ask UI to query specific > > thread > > > > > > context > > > > > > > >>> based > > > > > > > >>> > > on timestamp and thread number(s). 2. For thread > dump, I > > > > don't > > > > > > > know > > > > > > > >>> > > whether you do the performance evaluation for this OP. > > > From > > > > my > > > > > > > >>> > > experiences, `get all need thread monitor segment > every > > > 100 > > > > > > > >>> > milliseconds` > > > > > > > >>> > > is a very high cost in your application and agent. So, > > you > > > > may > > > > > > > need > > > > > > > >>> to > > > > > > > >>> > be > > > > > > > >>> > > careful about doing this. 3. Endpoint related thread > > dump > > > > with > > > > > > > some > > > > > > > >>> > > sampling mechanisms makes more sense to me. And this > > > should > > > > be > > > > > > > >>> activated > > > > > > > >>> > > by UI. We should only provide a conditional thread > dump > > > > > sampling > > > > > > > >>> > > mechanism, such as `first 5 traces of this endpoint in > > the > > > > > next > > > > > > 5 > > > > > > > >>> mins`. > > > > > > > >>> > > Jian Tan I think DaoCloud also has customized this > > feature > > > > in > > > > > > your > > > > > > > >>> > > internal SkyWalking. Could you share what you do? > Sheng > > Wu > > > > 吴晟 > > > > > > > >>> Twitter, > > > > > > > >>> > > wusheng1108 741550557 [email protected] 于2019年12月8日周日 > > > > > 上午12:14写道: > > > > > > > >>> Hello > > > > > > > >>> > > everyone, I would like to share a new feature with > > > > skywalking, > > > > > > > >>> called > > > > > > > >>> > > “thread monitor”. Background When our company used > > > > skywalking > > > > > to > > > > > > > APM > > > > > > > >>> > > earlier, we found that many traces did not have enough > > > span > > > > to > > > > > > > fill > > > > > > > >>> up, > > > > > > > >>> > > doubting whether there were some third-party > frameworks > > > that > > > > > we > > > > > > > >>> didn't > > > > > > > >>> > > enhance or programmers API usage errors such as java > > > > CountDown > > > > > > > >>> number > > > > > > > >>> > is 3 > > > > > > > >>> > > but there are only 2 countdowns. So we decide to > write a > > > new > > > > > > > >>> feature to > > > > > > > >>> > > monitor executing trace thread stack, then we can get > > more > > > > > > > >>> information > > > > > > > >>> > on > > > > > > > >>> > > the trace, quick known what’s happening on that trace. > > > > > Structure > > > > > > > >>> > > Agent(thread monitor) — gRPC protocol — OAP > > > Server(Storage) > > > > — > > > > > > > >>> > > Skywalking-Rocketbot-UI More detail OAP Server: 1. > > Storage > > > > > witch > > > > > > > >>> traces > > > > > > > >>> > > need to monitor(i suggest storage on the endpoint, add > > new > > > > > > boolean > > > > > > > >>> field > > > > > > > >>> > > named needThreadMonitor) 2. Provide GraphQL API to > > change > > > > > > endpoint > > > > > > > >>> > monitor > > > > > > > >>> > > status. 3. Monitor Trace parse, storage thread stack > if > > > the > > > > > > > segment > > > > > > > >>> has > > > > > > > >>> > > any thread info. Skywalking-Rocketbot-UI: 1. Add a new > > > > switch > > > > > > > >>> button on > > > > > > > >>> > > the dashboard, It can read or modify endpoint status. > 2. > > > It > > > > > will > > > > > > > >>> show > > > > > > > >>> > > every thread stack on click trace detail. Agent: 1. > > setup > > > > two > > > > > > new > > > > > > > >>> > > BootService: 1) find any need thread monitor endpoint > in > > > > > current > > > > > > > >>> > service, > > > > > > > >>> > > start on a new schedule take and works on each minute. > > 2) > > > > > start > > > > > > > new > > > > > > > >>> > > schedule task to get all need thread monitor segment > > each > > > > 100 > > > > > > > >>> > > milliseconds, and put a new thread dump task to a > global > > > > > thread > > > > > > > >>> > > pool(fixed, count number default 3). 2. check endpoint > > need > > > > > > thread > > > > > > > >>> > monitor > > > > > > > >>> > > on create entry/local > > > > > span(TracingConext#createEntry/LocalSpan). > > > > > > > If > > > > > > > >>> > need, > > > > > > > >>> > > It will be marked and put into thread monitor map. 3. > > when > > > > > > > >>> > TraceingContext > > > > > > > >>> > > finishes, It will get thread has monitored, and send > all > > > > > thread > > > > > > > >>> stack to > > > > > > > >>> > > server. Finally, I don’t know it is a good idea to get > > > more > > > > > > > >>> information > > > > > > > >>> > on > > > > > > > >>> > > trace? If you have any good ideas or suggestions on > > this, > > > > > please > > > > > > > >>> let me > > > > > > > >>> > > know. Mrpro > > > > > > > >>> > > > > > > > > >>> > > > > > > > >> -- > > > > > > > > Sheng Wu 吴晟 > > > > > > > > > > > > > > > > Apache SkyWalking > > > > > > > > Apache Incubator > > > > > > > > Apache ShardingSphere, ECharts, DolphinScheduler podlings > > > > > > > > Zipkin > > > > > > > > Twitter, wusheng1108 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
