Re: A proposal for Skywalking(thread monitor)

han liu Fri, 13 Dec 2019 06:41:21 -0800

I see what you mean. I think this is a good feature, and I will summarize
this into the design doc.


If you have any good suggestions，please let me know.

Sheng Wu <[email protected]> 于2019年12月13日周五 下午10:24写道：

> Hi Han Liu
>
> One more reminder, a trace id in one instance could have multiple threads
> sampling in theory, such as across threads scenarios. We also should set a
> threshold for this. Max 3 threads for one trace id maybe?
>
> Sheng Wu 吴晟
> Twitter, wusheng1108
>
>
> Sheng Wu <[email protected]> 于2019年12月13日周五 下午1:03写道：
>
> > Hi Han Liu and everyone
> >
> > I have submitted a design draft to the doc. Please take a look, if you
> > have an issue, please let me known. We could set up a online meeting too.
> >
> > Sheng Wu <[email protected]>于2019年12月12日 周四下午8:49写道：
> >
> >> Hi Han Liu
> >>
> >> I have replied the design with the most important key points I expect.
> >> Let's discuss those. After we are on the same page, we could continue on
> >> more details.
> >>
> >> Sheng Wu 吴晟
> >> Twitter, wusheng1108
> >>
> >>
> >> han liu <[email protected]> 于2019年12月12日周四 下午2:26写道：
> >>
> >>> Due to formatting issues with previous mailboxes, they have been
> replaced
> >>> with new ones.
> >>>
> >>> I have completed some of the features in the google doc, and can
> provide
> >>> your comments and improvements. I will continue to improve the
> following
> >>> functions in the documentation.
> >>> The documentation is the same as you previously sent me. To prevent
> >>> trouble, I'll post the link again here.
> >>>
> >>>
> https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit#
> >>>
> >>> Sheng Wu <[email protected]> 于2019年12月10日周二 上午10:46写道：
> >>>
> >>> > 741550557 <[email protected]> 于2019年12月9日周一 下午9:42写道：
> >>> >
> >>> > > Thank for your reply, the issues you mentioned are very critical
> and
> >>> > > meaningful.
> >>> > > There I will answer what you mentioned. Sorry, I'm not good at
> >>> comment
> >>> > > mode, so I use different colors and “ “ prefix to QA.
> >>> > >
> >>> > >
> >>> > >  As we already have designed limit mechanism at backend and agent
> >>> > >  side(according to your design), also the number would not be
> big(10
> >>> most
> >>> > >  likely), we just need a list to storage the trace-id(s)
> >>> > >
> >>> > >
> >>> > > If just need a list to storage trace-id(s), so how can I map to the
> >>> > > thread? I hope to use the map to quickly find thread info from
> >>> trace-id.
> >>> > > How can I get thread-stack information from your way? Could you
> >>> please
> >>> > > help elaborate?
> >>> > >
> >>> >
> >>> > Why do you need to do that? You just save a list of thread ids which
> >>> should
> >>> > do thread dump, or remove some thread id from them when the trace id
> is
> >>> > finished.
> >>> > This is easy to do this by doing a loop search in the list. Right?
> >>> > Thread-stack is in the list, they are stored in an element. Also,
> they
> >>> are
> >>> > in a list too.
> >>> >
> >>> > I think you were thinking the same all stack in a single map? That
> will
> >>> > cause a very dangerous memory and GC risk.
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >  Could you explain the (2), what do you mean `stop`? I think if
> your
> >>> > >  sampling mechanism should include the sampling duration.
> >>> > >
> >>> > >
> >>> > > As far as the communication between the sniffer and the OAP
> server, I
> >>> > hope
> >>> > > the sniffer only needs to obtain the thread-monitor task that needs
> >>> to be
> >>> > > monitored at this time. The termination condition can be stopped by
> >>> the
> >>> > > sniffer or the OAP server.
> >>> > > If It’s just an OAP server notification, it may be more
> complicated.
> >>> > Cause
> >>> > > OAP server need record sniffer has received the current command,
> and
> >>> > > sniffer is not stable, such as sniffer has shutdown when receiving
> >>> the
> >>> > > command, at this time, no thread information I have been collected.
> >>> > >
> >>> > >
> >>> > > I think that the active calculation termination by the OAP server
> can
> >>> > make
> >>> > > the monitoring more controllable, of course, the client can also
> >>> actively
> >>> > > report the end.
> >>> > > I think it’s necessary to provide a protection mechanism for the
> >>> sniffer
> >>> > > side, and it can be released quickly when the business peak period
> >>> or the
> >>> > > probe suddenly occupies a lot of CPU / memory resources. Therefore,
> >>> the
> >>> > > function of stopping monitoring can be provided in the UI
> interface,
> >>> so
> >>> > > that the sniffer can recover.
> >>> > > Sampling duration is required, but only as a default termination
> >>> > > thread-monitor condition.
> >>> > >
> >>> >
> >>> > But you should know, in the real case, the thread dump monitor is a
> >>> > sampling mechanism, you are even hard to know where they are
> happening.
> >>> > Then you have to send the stop notification to every instance.
> >>> > Even you could send the notification, but could you explain how you
> >>> know to
> >>> > stop?
> >>> > The scenario is, you are facing an issue, which trace and metrics
> can't
> >>> > explain, so you active thread dump, right? At the same time, you want
> >>> to
> >>> > stop?
> >>> >
> >>> > CPU and memory resources should be guaranteed by design level, such
> as
> >>> > 1. Limited thread dump task for one service.
> >>> > 2. Limited thread dump traces in the certain time window.
> >>> > For example, the OAP backend/UI would say, you only could
> >>> > 1. Set 3 thread dump commands in the same time window.
> >>> > 2. Every command will require the sampling thread dump number should
> be
> >>> > less than 5 traces. At the same time, in order to make this sampling
> >>> works,
> >>> > only active sampling thread dump after the trace executed more than
> >>> > 200ms(value is an example only).
> >>> > 3. Thread dump could be sent to the backend duration sampling to
> >>> reduce the
> >>> > memory cache.
> >>> > 4. Thread dump period should not less than 5ms, recommend 20ms
> >>> > 5. How depth the thread dump should do
> >>> >
> >>> > We need a very detailed design, above are just my thoughts, in order
> to
> >>> > share the idea, the safe of the agent should not be by UI button.
> >>> > Otherwise, your online system will be very dangerous, which is not
> the
> >>> > design goal of SkyWalking.
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >  The sampling period depends on how you are going to visualize it.
> >>> > >
> >>> > >
> >>> > > Yes, I agree. I hope can provide a select/input let trace count and
> >>> time
> >>> > > windows can be configurable in UI. Of course, this is my current
> >>> idea,
> >>> > and
> >>> > > if there have other plains, I will adopt it.
> >>> > >
> >>> > >
> >>> > >  Highly doubt about this, reduce the memory, maybe, only reduce if
> >>> the
> >>> > > codes
> >>> > >  are running the loop or facing lock issue. But if it is neither of
> >>> these
> >>> > >  two, they are different.
> >>> > >  Also, please consider the CPU cost of the comparison of the stack.
> >>> You
> >>> > > need
> >>> > >  a performance benchmark to verify if you want this.
> >>> > >
> >>> > >
> >>> > > I didn’t understand that first sentence. In my personal experience,
> >>> most
> >>> > > of the cases are blocking in the lock(socket/local) and running
> >>> loop. I
> >>> > > have not imagined any other cases?
> >>> > > For the second sentence, I think I can add a thread-stack-element
> >>> field
> >>> > to
> >>> > > storage the top-level element of last stack information. When get
> >>> stack
> >>> > > information next time, I can compare the current top-level element
> >>> that
> >>> > is
> >>> > > the same with that field.
> >>> > > I do this mainly to reduce duplicate thread-stack information form
> >>> taking
> >>> > > up too much memory space, this is a way to optimizing memory space.
> >>> It
> >>> > can
> >>> > > consider remove it, or do you have a better memory-saving solution?
> >>> After
> >>> > > all, memory and CPU resources are very valuable in the sniffer.
> >>> > >
> >>> >
> >>> > I know you mean about reducing the memory, but do you consider how
> >>> much CPU
> >>> > you will cost do a full thread dump comparison? The thread dump could
> >>> > easily be hundreds of lines in Java.
> >>> > I mean this is a tradeoff, CPU or memory. If you are just using
> limited
> >>> > memory, before you could send the snapshot to backend while
> collecting
> >>> new,
> >>> > even could save into the disk(if really necessary).
> >>> > In my experience, compress is always very high risk in the agent, if
> >>> you
> >>> > want to do that, you need a benchmark test to improve that, this CPU
> >>> cost
> >>> > is small enough.
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >  The trace number and time window should be configurable, that is I
> >>> mean
> >>> > >  more complex. Inthe current SamplingServcie, only n traces per 3
> >>> > seconds.
> >>> > >  But here, it is a dynamic rule.
> >>> > >
> >>> > >
> >>> > > I expect that it can be configured at the UI level for special
> trace
> >>> > count
> >>> > > and time windows as I said above.
> >>> > > For SamplingService, my previous tech design was not rigorous
> >>> enough, and
> >>> > > there were indeed problems.
> >>> > > Maybe we need to extend a new SamplingService, build a mapping base
> >>> on
> >>> > > endpoint-id and AtomicInteger.
> >>> > > For `first 5 traces of this endpoint in the next 5 mins`, just need
> >>> to
> >>> > > increment it.
> >>> > > For sampling, maybe use another schedule task to reset
> AtomicInteger
> >>> > value.
> >>> > >
> >>> >
> >>> > You could avoid map, by using ArrayList with
> >>> RangeAtomicInteger(SkyWalking
> >>> > provides that) to let the trace context to get the slot.
> >>> > Also, you are considering `active sampling after trace execution time
> >>> more
> >>> > than xxx ms`, you should add remove mechanism during runtime.
> >>> > Anyway, try your best to avoid using Map, especially this map could
> be
> >>> > changed in the runtime.
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >  I think at least should be a level one new page called
> >>> configuration or
> >>> > >  command page, which could set up the multiple sampling rule and
> >>> > visualize
> >>> > >  the existing tasks and related sampling data.
> >>> > >
> >>> > >
> >>> > > I think it’s necessary to add a new page to the configuration
> >>> > > thread-monitor task, I think the specific UI display should be
> >>> designed
> >>> > in
> >>> > > detail.
> >>> > > For example, what I expected is similar to the trace page. The left
> >>> side
> >>> > > displays the configuration, and the right side quickly displays the
> >>> > related
> >>> > > trace list. When clicked, it quickly links to the trace page and
> >>> displays
> >>> > > the sidebox display.
> >>> > > I ’m not good at this. Do you have any good plans?
> >>> > >
> >>> >
> >>> > UI is the thing that is hard to discuss by text, so I am pretty sure,
> >>> we
> >>> > need some demo(could not be the codes, that is I mean drew by a tool)
> >>> > It is OK to show a trace with thread dumps on another page, even
> better
> >>> > linking to your task ID.
> >>> > But this kind of abstract description is hard to continue, no
> details I
> >>> > mean.
> >>> >
> >>> >
> >>> >
> >>> > > And I feel that the two of us have a different understanding of the
> >>> > > configuration object. I think it is more of a task than a command.
> I
> >>> > don't
> >>> > > know which way is better?
> >>> > > I suddenly thought of a problem. I think that some real problems
> are
> >>> > often
> >>> > > triggered at a specific period, such as a fixed business peak
> >>> period, and
> >>> > > we cannot guarantee that the user will operate on the UI.
> >>> > > So should the task mechanism be adopted to ensure that it can be
> run
> >>> at a
> >>> > > specific period?
> >>> > >
> >>> >
> >>> > This makes sense to me, and it is a just enhance feature. It is just
> a
> >>> > start time sampling rule.
> >>> >
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >  We don't have separated thread monitor view table, how about we
> add
> >>> an
> >>> > > icon
> >>> > >  at the segment list, and add icon at the first span of this
> segment
> >>> in
> >>> > >  trace detail view?
> >>> > >  I think the latter one should be an entrance of the thread view.
> >>> > >
> >>> > >
> >>> > > I think it's a good idea. The link I mentioned in one of the
> answers
> >>> > > above, I think it is also a convenient entry point.
> >>> > > The switch button I mentioned earlier is only a data filtering item
> >>> in
> >>> > the
> >>> > > query of the trace list and does not need a separate table UI.
> >>> > >
> >>> >
> >>> > As you intend to have a separated page for thread sampling, it is OK
> to
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >  If you have some visualization idea, drawn by any tool you like
> >>> > supporting
> >>> > >  comment, we could discuss it there. In my mind, we should support
> >>> > > visualize
> >>> > >  the thread dump stack through the time windows, and support
> >>> aggregate
> >>> > them
> >>> > >  by choosing the continued stack snapshots on the time window.
> >>> > >
> >>> > >
> >>> > > I think we should find a front-end who is better at discussing
> >>> together
> >>> > > because this depends on how the front-end UI can be displayed.
> >>> > > BTW: I can provide code for the OAP server and sniffer, and the
> >>> frontend
> >>> > > may need to look for help in the community alone. Hope that any
> >>> front-end
> >>> > > friends can participate in the topic discussion.
> >>> > >
> >>> >
> >>> > Once you have the demo, I could loop our UI committers in for UI side
> >>> > development. But UI committers may not be familiar with thread dump
> >>> context
> >>> > story. We need to resolve that first.
> >>> > Let's start up a demo, such as some slides on Google doc?
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > The above is my answer to all the questions, and I look forward to
> >>> your
> >>> > > reply at any time. As more and more discussions took place, the
> >>> details
> >>> > > became more and more complete. This is good.
> >>> > > Everyone is welcome to discuss together if you have any questions
> or
> >>> good
> >>> > > ideas, please let me know.
> >>> > >
> >>> >
> >>> > I think we could move the discussion to the design doc as the next
> >>> step.
> >>> >
> >>> > Please use this
> >>> >
> >>> >
> >>>
> https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit#
> >>> > Trite the design including
> >>> > 1. Key features
> >>> > 2. Protocol
> >>> > 3. Work mechanism
> >>> > 4. UI design, prototype
> >>> > and anything you think important before writing codes.
> >>> >
> >>> > This is SkyWalking CLI design doc, you could use it as a reference.
> >>> >
> >>> >
> >>>
> https://docs.google.com/document/d/1WBnRNF0ABxaSdBZo6Gv2hMzCQzj04YAePUdOyLWHWew/edit#
> >>> >
> >>> >
> >>> > >
> >>> > >
> >>> > > 原始邮件
> >>> > > 发件人:Sheng [email protected]
> >>> > > 收件人:[email protected]
> >>> > > 发送时间:2019年12月9日(周一) 10:50
> >>> > > 主题:Re: A proposal for Skywalking(thread monitor)
> >>> > >
> >>> > >
> >>> > > Hi Thanks for writing this proposal with a detailed design. My
> >>> comments
> >>> > > are inline. 741550557 [email protected] 于2019年12月8日周日 下午11:22写道：
> >>> Thanks
> >>> > > for your reply, I have carefully read these issues you mentioned,
> >>> and
> >>> > > these issues mentioned are very meaningful and critical. I will
> >>> give  you
> >>> > > technical details about the issues you mentioned below.  I find
> these
> >>> > > issues are related, so I will explain them in different
> dimensions.
> >>> > use
> >>> > > a different protocol to transmission trace and thread-stack:  1.
> add
> >>> a
> >>> > > boolean field in segment data, to record has thread monitored.  and
> >>> is
> >>> > good
> >>> > > for filter monitored trace in UI.  2. add new BootService, storage
> >>> Map to
> >>> > > record relate trace-id and  trace-stack information.  As we already
> >>> have
> >>> > > designed limit mechanism at backend and agent side(according to
> your
> >>> > > design), also the number would not be big(10 most likely), we just
> >>> need a
> >>> > > list to storage the trace-id(s)  3. listen
> >>> > > TracingContextListener#afterFinished if the current segment has
> >>> thread
> >>> > > monitored, mark current trace-id don’t need to monitor anymore.
> >>> (Cause
> >>> > if
> >>> > > for-each the step 2 map, the remove operation will fail and throw
> >>> > > exception).  4. when thread-monitor main thread running, It will
> >>> for-each
> >>> > > step 2 map  and check is it don’t need monitor anymore, I will put
> >>> data
> >>> > > into new data  carrier.  5. generate new thread-monitor gRPC
> >>> protocol to
> >>> > > send data from the data  carrier. The agent side design seems
> pretty
> >>> > good.
> >>> > >   the server receives thread-stack logic:  1. storage stack-stack
> >>> > > informations and trace-id/segment-id relations on a  different
> >>> table.  2.
> >>> > > check thread-monitor is need to be stop on receiving data or
> >>> schedule.
> >>> > > Could you explain the (2), what do you mean `stop`? I think if your
> >>> > > sampling mechanism should include the sampling duration.    reduce
> >>> CPU
> >>> > and
> >>> > > memory in sniffer:  1. through the configuration of thread
> >>> monitoring in
> >>> > > the UI, you can  configure the performance loss. For example, set
> the
> >>> > > monitoring level: fast  monitoring (100ms), medium speed monitoring
> >>> > > (500ms), slow speed monitoring  (1000ms).  The sampling period
> >>> depends on
> >>> > > how you are going to visualize it.  2. add new integer field on per
> >>> > > thread-stack, if current thread-stack last  element same as last
> >>> time,
> >>> > > don’t need storage, just increment it. I think  it will save a lot
> of
> >>> > > memory space. Highly doubt about this, reduce the memory, maybe,
> only
> >>> > > reduce if the codes are running the loop or facing lock issue. But
> >>> if it
> >>> > is
> >>> > > neither of these two, they are different. Also, please consider the
> >>> CPU
> >>> > > cost of the comparison of the stack. You need a performance
> >>> benchmark to
> >>> > > verify if you want this. 3. create new VM args to setting
> >>> thread-monitor
> >>> > > pool size, It dependence on  user, maybe default 3? (this can be
> >>> > discussed
> >>> > > later)  I think UI limit is enough. 3 seems good to me.  4. limit
> >>> > > thread-stack-element size to 100, I think it can resolve most of
> the
> >>> > > scenes already. It also can create a new VM args if need.
> multiple
> >>> > > sampling methods can choose :(just my current thoughts, can add
> >>> more)
> >>> > 1.
> >>> > > base on current client SamplingServcie, extra a new factor holder
> to
> >>> > > increment, and reset on schedule.  Yours may be a little more
> complex
> >>> > than
> >>> > > the current SamplingServcie, right? Based on the next rule. 2.
> >>> `first 5
> >>> > > traces of this endpoint in the next 5 mins`, it a good idea. My
> >>> > > understanding is that within a few minutes, each instance can send
> a
> >>> > > specified number of traces.  The trace number and time window
> should
> >>> be
> >>> > > configurable, that is I mean more complex. Inthe current
> >>> SamplingServcie,
> >>> > > only n traces per 3 seconds. But here, it is a dynamic rule.    UI
> >>> > settings
> >>> > > and sniffer perception:  1. create a new button on the dashboard
> >>> page, It
> >>> > > can create or stop a  thread-monitor. It can be dynamic load
> >>> > thread-monitor
> >>> > > status when  reselecting endpoint.  I think at least should be a
> >>> level
> >>> > one
> >>> > > new page called configuration or command page, which could set up
> the
> >>> > > multiple sampling rule and visualize the existing tasks and related
> >>> > > sampling data.  2. sniffer creates a new scheduled task to check
> the
> >>> > > current service has  need monitor endpoint each 5 seconds. (I see
> >>> current
> >>> > > sniffer has command  functions, feel that principle is the same as
> >>> the
> >>> > > scheduler)  Seems reasonable.   thread-monitor on the UI:(That’s
> >>> just my
> >>> > > initial thoughts, I think there  will have a better way to show)
> 1.
> >>> When
> >>> > > switch to the trace page, I think we need to add a new switch
> >>> button to
> >>> > > filter thread-monitor trace.  2. make a new thread-monitor icon on
> >>> the
> >>> > same
> >>> > > segment. It means it has  thread-stack information.  We don't have
> >>> > > separated thread monitor view table, how about we add an icon at
> the
> >>> > > segment list, and add icon at the first span of this segment in
> trace
> >>> > > detail view? I think the latter one should be an entrance of the
> >>> thread
> >>> > > view. 3. show on the information sidebox when the user clicks the
> >>> > > thread-monitor  segment(any span). create a new tab, like the log
> >>> tab.
> >>> > If
> >>> > > you have some visualization idea, drawn by any tool you like
> >>> supporting
> >>> > > comment, we could discuss it there. In my mind, we should support
> >>> > visualize
> >>> > > the thread dump stack through the time windows, and support
> aggregate
> >>> > them
> >>> > > by choosing the continued stack snapshots on the time window.
> >>>  They're
> >>> > > just a description of my current implementation details for
> >>> > thread-monitor
> >>> > > if these seem to work. I can do some time planning for these
> tasks.
> >>> > Sorry,
> >>> > > my English is not very well, hope you can understand. Maybe  these
> >>> seem
> >>> > to
> >>> > > have some problem, any good idea or suggestion are welcome.  Very
> >>> > > appreciated you to lead this new direction. It is a long term task
> >>> but
> >>> > > should be interesting. :) Good work, carry on.      原始邮件  发件人:Sheng
> >>> > > [email protected]  收件人:[email protected]
> >>> > > 发送时间:2019年12月8日(周日) 08:31  主题:Re: A proposal for Skywalking(thread
> >>> > > monitor)    First of all, thanks for your proposal. Thread
> >>> monitoring is
> >>> > > super  important for application performance. So basically, I agree
> >>> with
> >>> > > this  proposal. But for tech details, I think we need more
> >>> discussion in
> >>> > > the  following ways 1. Do you want to add thread status to the
> >>> trace? If
> >>> > > so, why  don't consider this as a UI level join? Because we could
> >>> know
> >>> > > thread id in  the trace when we create a span, right? Then we have
> >>> all
> >>> > the
> >>> > > thread  dump(if), we could ask UI to query specific thread context
> >>> based
> >>> > > on  timestamp and thread number(s). 2. For thread dump, I don't
> know
> >>> > > whether  you do the performance evaluation for this OP. From my
> >>> > > experiences, `get  all need thread monitor segment every 100
> >>> > milliseconds`
> >>> > > is a very high cost  in your application and agent. So, you may
> need
> >>> to
> >>> > be
> >>> > > careful about doing  this. 3. Endpoint related thread dump with
> some
> >>> > > sampling mechanisms makes  more sense to me. And this should be
> >>> activated
> >>> > > by UI. We should only  provide a conditional thread dump sampling
> >>> > > mechanism, such as `first 5  traces of this endpoint in the next 5
> >>> mins`.
> >>> > > Jian Tan I think DaoCloud also  has customized this feature in your
> >>> > > internal SkyWalking. Could you share  what you do? Sheng Wu 吴晟
> >>> Twitter,
> >>> > > wusheng1108 741550557 [email protected]  于2019年12月8日周日 上午12:14写道：
> >>> Hello
> >>> > > everyone, I would like to share a new  feature with skywalking,
> >>> called
> >>> > > “thread monitor”. Background When our  company used skywalking to
> APM
> >>> > > earlier, we found that many traces did not  have enough span to
> fill
> >>> up,
> >>> > > doubting whether there were some third-party  frameworks that we
> >>> didn't
> >>> > > enhance or programmers API usage errors such as  java CountDown
> >>> number
> >>> > is 3
> >>> > > but there are only 2 countdowns. So we decide  to write a new
> >>> feature to
> >>> > > monitor executing trace thread stack, then we  can get more
> >>> information
> >>> > on
> >>> > > the trace, quick known what’s happening on  that trace. Structure
> >>> > > Agent(thread monitor) — gRPC protocol — OAP  Server(Storage) —
> >>> > > Skywalking-Rocketbot-UI More detail OAP Server:  1. Storage witch
> >>> traces
> >>> > > need to monitor(i suggest storage on the endpoint,  add new boolean
> >>> field
> >>> > > named needThreadMonitor) 2. Provide GraphQL API to  change endpoint
> >>> > monitor
> >>> > > status. 3. Monitor Trace parse, storage thread  stack if the
> segment
> >>> has
> >>> > > any thread info. Skywalking-Rocketbot-UI: 1.  Add a new switch
> >>> button on
> >>> > > the dashboard, It can read or modify endpoint  status. 2. It will
> >>> show
> >>> > > every thread stack on click trace detail.  Agent: 1. setup two new
> >>> > > BootService: 1) find any need thread monitor  endpoint in current
> >>> > service,
> >>> > > start on a new schedule take and works on  each minute. 2) start
> new
> >>> > > schedule task to get all need thread monitor  segment each 100
> >>> > > milliseconds, and put a new thread dump task to a global  thread
> >>> > > pool(fixed, count number default 3). 2. check endpoint need thread
> >>> > monitor
> >>> > > on create entry/local span(TracingConext#createEntry/LocalSpan).
> If
> >>> > need,
> >>> > > It will be marked and put into thread monitor map. 3. when
> >>> > TraceingContext
> >>> > > finishes, It will get thread has monitored, and send all  thread
> >>> stack to
> >>> > > server. Finally, I don’t know it is a good idea to get  more
> >>> information
> >>> > on
> >>> > > trace? If you have any good ideas or suggestions on  this, please
> >>> let me
> >>> > > know. Mrpro
> >>> >
> >>>
> >> --
> > Sheng Wu 吴晟
> >
> > Apache SkyWalking
> > Apache Incubator
> > Apache ShardingSphere, ECharts, DolphinScheduler podlings
> > Zipkin
> > Twitter, wusheng1108
> >
>

Re: A proposal for Skywalking(thread monitor)

Reply via email to