Re: A proposal for Skywalking(thread monitor)

han liu Fri, 13 Dec 2019 22:55:27 -0800

Okay, I have modified these two issues.

1. New field called maxThreadThresold.
    Added new input box to "add-task-modal" tab in prototype
    Added new conditions for limiting the size of the field in doc chapter
"Conditions that can create thread monitoring tasks"
    Added description of this field in doc chapter "Thread monitoring table
structures"


2. Modify endpoint id to the endpoint name.
    This field was modified in doc chapter "Thread monitoring table
structures"
    Modify the input box to "add-task-modal" tab in prototype

I will continue to modify for other issues in the comment.


Sheng Wu <[email protected]> 于2019年12月14日周六 上午9:46写道：

> han liu <[email protected]> 于2019年12月13日周五 下午10:40写道：
>
> > I see what you mean. I think this is a good feature, and I will summarize
> > this into the design doc.
> >
> > If you have any good suggestions，please let me know.
> >
>
> A simple way should be enough, set up sampling rule based on first span
> operation name, rather than endpoint it.
> In this case, plus #4056[1], there will be no id for local span or exit
> span. but those two are used in the first span in the async scenario.
>
> [1] https://github.com/apache/skywalking/issues/4056
>
>
> Sheng Wu 吴晟
> Twitter, wusheng1108
>
>
> >
> > Sheng Wu <[email protected]> 于2019年12月13日周五 下午10:24写道：
> >
> > > Hi Han Liu
> > >
> > > One more reminder, a trace id in one instance could have multiple
> threads
> > > sampling in theory, such as across threads scenarios. We also should
> set
> > a
> > > threshold for this. Max 3 threads for one trace id maybe?
> > >
> > > Sheng Wu 吴晟
> > > Twitter, wusheng1108
> > >
> > >
> > > Sheng Wu <[email protected]> 于2019年12月13日周五 下午1:03写道：
> > >
> > > > Hi Han Liu and everyone
> > > >
> > > > I have submitted a design draft to the doc. Please take a look, if
> you
> > > > have an issue, please let me known. We could set up a online meeting
> > too.
> > > >
> > > > Sheng Wu <[email protected]>于2019年12月12日 周四下午8:49写道：
> > > >
> > > >> Hi Han Liu
> > > >>
> > > >> I have replied the design with the most important key points I
> expect.
> > > >> Let's discuss those. After we are on the same page, we could
> continue
> > on
> > > >> more details.
> > > >>
> > > >> Sheng Wu 吴晟
> > > >> Twitter, wusheng1108
> > > >>
> > > >>
> > > >> han liu <[email protected]> 于2019年12月12日周四 下午2:26写道：
> > > >>
> > > >>> Due to formatting issues with previous mailboxes, they have been
> > > replaced
> > > >>> with new ones.
> > > >>>
> > > >>> I have completed some of the features in the google doc, and can
> > > provide
> > > >>> your comments and improvements. I will continue to improve the
> > > following
> > > >>> functions in the documentation.
> > > >>> The documentation is the same as you previously sent me. To prevent
> > > >>> trouble, I'll post the link again here.
> > > >>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit#
> > > >>>
> > > >>> Sheng Wu <[email protected]> 于2019年12月10日周二 上午10:46写道：
> > > >>>
> > > >>> > 741550557 <[email protected]> 于2019年12月9日周一 下午9:42写道：
> > > >>> >
> > > >>> > > Thank for your reply, the issues you mentioned are very
> critical
> > > and
> > > >>> > > meaningful.
> > > >>> > > There I will answer what you mentioned. Sorry, I'm not good at
> > > >>> comment
> > > >>> > > mode, so I use different colors and “ “ prefix to QA.
> > > >>> > >
> > > >>> > >
> > > >>> > >  As we already have designed limit mechanism at backend and
> agent
> > > >>> > >  side(according to your design), also the number would not be
> > > big(10
> > > >>> most
> > > >>> > >  likely), we just need a list to storage the trace-id(s)
> > > >>> > >
> > > >>> > >
> > > >>> > > If just need a list to storage trace-id(s), so how can I map to
> > the
> > > >>> > > thread? I hope to use the map to quickly find thread info from
> > > >>> trace-id.
> > > >>> > > How can I get thread-stack information from your way? Could you
> > > >>> please
> > > >>> > > help elaborate?
> > > >>> > >
> > > >>> >
> > > >>> > Why do you need to do that? You just save a list of thread ids
> > which
> > > >>> should
> > > >>> > do thread dump, or remove some thread id from them when the trace
> > id
> > > is
> > > >>> > finished.
> > > >>> > This is easy to do this by doing a loop search in the list.
> Right?
> > > >>> > Thread-stack is in the list, they are stored in an element. Also,
> > > they
> > > >>> are
> > > >>> > in a list too.
> > > >>> >
> > > >>> > I think you were thinking the same all stack in a single map?
> That
> > > will
> > > >>> > cause a very dangerous memory and GC risk.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >  Could you explain the (2), what do you mean `stop`? I think if
> > > your
> > > >>> > >  sampling mechanism should include the sampling duration.
> > > >>> > >
> > > >>> > >
> > > >>> > > As far as the communication between the sniffer and the OAP
> > > server, I
> > > >>> > hope
> > > >>> > > the sniffer only needs to obtain the thread-monitor task that
> > needs
> > > >>> to be
> > > >>> > > monitored at this time. The termination condition can be
> stopped
> > by
> > > >>> the
> > > >>> > > sniffer or the OAP server.
> > > >>> > > If It’s just an OAP server notification, it may be more
> > > complicated.
> > > >>> > Cause
> > > >>> > > OAP server need record sniffer has received the current
> command,
> > > and
> > > >>> > > sniffer is not stable, such as sniffer has shutdown when
> > receiving
> > > >>> the
> > > >>> > > command, at this time, no thread information I have been
> > collected.
> > > >>> > >
> > > >>> > >
> > > >>> > > I think that the active calculation termination by the OAP
> server
> > > can
> > > >>> > make
> > > >>> > > the monitoring more controllable, of course, the client can
> also
> > > >>> actively
> > > >>> > > report the end.
> > > >>> > > I think it’s necessary to provide a protection mechanism for
> the
> > > >>> sniffer
> > > >>> > > side, and it can be released quickly when the business peak
> > period
> > > >>> or the
> > > >>> > > probe suddenly occupies a lot of CPU / memory resources.
> > Therefore,
> > > >>> the
> > > >>> > > function of stopping monitoring can be provided in the UI
> > > interface,
> > > >>> so
> > > >>> > > that the sniffer can recover.
> > > >>> > > Sampling duration is required, but only as a default
> termination
> > > >>> > > thread-monitor condition.
> > > >>> > >
> > > >>> >
> > > >>> > But you should know, in the real case, the thread dump monitor
> is a
> > > >>> > sampling mechanism, you are even hard to know where they are
> > > happening.
> > > >>> > Then you have to send the stop notification to every instance.
> > > >>> > Even you could send the notification, but could you explain how
> you
> > > >>> know to
> > > >>> > stop?
> > > >>> > The scenario is, you are facing an issue, which trace and metrics
> > > can't
> > > >>> > explain, so you active thread dump, right? At the same time, you
> > want
> > > >>> to
> > > >>> > stop?
> > > >>> >
> > > >>> > CPU and memory resources should be guaranteed by design level,
> such
> > > as
> > > >>> > 1. Limited thread dump task for one service.
> > > >>> > 2. Limited thread dump traces in the certain time window.
> > > >>> > For example, the OAP backend/UI would say, you only could
> > > >>> > 1. Set 3 thread dump commands in the same time window.
> > > >>> > 2. Every command will require the sampling thread dump number
> > should
> > > be
> > > >>> > less than 5 traces. At the same time, in order to make this
> > sampling
> > > >>> works,
> > > >>> > only active sampling thread dump after the trace executed more
> than
> > > >>> > 200ms(value is an example only).
> > > >>> > 3. Thread dump could be sent to the backend duration sampling to
> > > >>> reduce the
> > > >>> > memory cache.
> > > >>> > 4. Thread dump period should not less than 5ms, recommend 20ms
> > > >>> > 5. How depth the thread dump should do
> > > >>> >
> > > >>> > We need a very detailed design, above are just my thoughts, in
> > order
> > > to
> > > >>> > share the idea, the safe of the agent should not be by UI button.
> > > >>> > Otherwise, your online system will be very dangerous, which is
> not
> > > the
> > > >>> > design goal of SkyWalking.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >  The sampling period depends on how you are going to visualize
> > it.
> > > >>> > >
> > > >>> > >
> > > >>> > > Yes, I agree. I hope can provide a select/input let trace count
> > and
> > > >>> time
> > > >>> > > windows can be configurable in UI. Of course, this is my
> current
> > > >>> idea,
> > > >>> > and
> > > >>> > > if there have other plains, I will adopt it.
> > > >>> > >
> > > >>> > >
> > > >>> > >  Highly doubt about this, reduce the memory, maybe, only reduce
> > if
> > > >>> the
> > > >>> > > codes
> > > >>> > >  are running the loop or facing lock issue. But if it is
> neither
> > of
> > > >>> these
> > > >>> > >  two, they are different.
> > > >>> > >  Also, please consider the CPU cost of the comparison of the
> > stack.
> > > >>> You
> > > >>> > > need
> > > >>> > >  a performance benchmark to verify if you want this.
> > > >>> > >
> > > >>> > >
> > > >>> > > I didn’t understand that first sentence. In my personal
> > experience,
> > > >>> most
> > > >>> > > of the cases are blocking in the lock(socket/local) and running
> > > >>> loop. I
> > > >>> > > have not imagined any other cases?
> > > >>> > > For the second sentence, I think I can add a
> thread-stack-element
> > > >>> field
> > > >>> > to
> > > >>> > > storage the top-level element of last stack information. When
> get
> > > >>> stack
> > > >>> > > information next time, I can compare the current top-level
> > element
> > > >>> that
> > > >>> > is
> > > >>> > > the same with that field.
> > > >>> > > I do this mainly to reduce duplicate thread-stack information
> > form
> > > >>> taking
> > > >>> > > up too much memory space, this is a way to optimizing memory
> > space.
> > > >>> It
> > > >>> > can
> > > >>> > > consider remove it, or do you have a better memory-saving
> > solution?
> > > >>> After
> > > >>> > > all, memory and CPU resources are very valuable in the sniffer.
> > > >>> > >
> > > >>> >
> > > >>> > I know you mean about reducing the memory, but do you consider
> how
> > > >>> much CPU
> > > >>> > you will cost do a full thread dump comparison? The thread dump
> > could
> > > >>> > easily be hundreds of lines in Java.
> > > >>> > I mean this is a tradeoff, CPU or memory. If you are just using
> > > limited
> > > >>> > memory, before you could send the snapshot to backend while
> > > collecting
> > > >>> new,
> > > >>> > even could save into the disk(if really necessary).
> > > >>> > In my experience, compress is always very high risk in the agent,
> > if
> > > >>> you
> > > >>> > want to do that, you need a benchmark test to improve that, this
> > CPU
> > > >>> cost
> > > >>> > is small enough.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >  The trace number and time window should be configurable, that
> > is I
> > > >>> mean
> > > >>> > >  more complex. Inthe current SamplingServcie, only n traces
> per 3
> > > >>> > seconds.
> > > >>> > >  But here, it is a dynamic rule.
> > > >>> > >
> > > >>> > >
> > > >>> > > I expect that it can be configured at the UI level for special
> > > trace
> > > >>> > count
> > > >>> > > and time windows as I said above.
> > > >>> > > For SamplingService, my previous tech design was not rigorous
> > > >>> enough, and
> > > >>> > > there were indeed problems.
> > > >>> > > Maybe we need to extend a new SamplingService, build a mapping
> > base
> > > >>> on
> > > >>> > > endpoint-id and AtomicInteger.
> > > >>> > > For `first 5 traces of this endpoint in the next 5 mins`, just
> > need
> > > >>> to
> > > >>> > > increment it.
> > > >>> > > For sampling, maybe use another schedule task to reset
> > > AtomicInteger
> > > >>> > value.
> > > >>> > >
> > > >>> >
> > > >>> > You could avoid map, by using ArrayList with
> > > >>> RangeAtomicInteger(SkyWalking
> > > >>> > provides that) to let the trace context to get the slot.
> > > >>> > Also, you are considering `active sampling after trace execution
> > time
> > > >>> more
> > > >>> > than xxx ms`, you should add remove mechanism during runtime.
> > > >>> > Anyway, try your best to avoid using Map, especially this map
> could
> > > be
> > > >>> > changed in the runtime.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >  I think at least should be a level one new page called
> > > >>> configuration or
> > > >>> > >  command page, which could set up the multiple sampling rule
> and
> > > >>> > visualize
> > > >>> > >  the existing tasks and related sampling data.
> > > >>> > >
> > > >>> > >
> > > >>> > > I think it’s necessary to add a new page to the configuration
> > > >>> > > thread-monitor task, I think the specific UI display should be
> > > >>> designed
> > > >>> > in
> > > >>> > > detail.
> > > >>> > > For example, what I expected is similar to the trace page. The
> > left
> > > >>> side
> > > >>> > > displays the configuration, and the right side quickly displays
> > the
> > > >>> > related
> > > >>> > > trace list. When clicked, it quickly links to the trace page
> and
> > > >>> displays
> > > >>> > > the sidebox display.
> > > >>> > > I ’m not good at this. Do you have any good plans?
> > > >>> > >
> > > >>> >
> > > >>> > UI is the thing that is hard to discuss by text, so I am pretty
> > sure,
> > > >>> we
> > > >>> > need some demo(could not be the codes, that is I mean drew by a
> > tool)
> > > >>> > It is OK to show a trace with thread dumps on another page, even
> > > better
> > > >>> > linking to your task ID.
> > > >>> > But this kind of abstract description is hard to continue, no
> > > details I
> > > >>> > mean.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > > And I feel that the two of us have a different understanding of
> > the
> > > >>> > > configuration object. I think it is more of a task than a
> > command.
> > > I
> > > >>> > don't
> > > >>> > > know which way is better?
> > > >>> > > I suddenly thought of a problem. I think that some real
> problems
> > > are
> > > >>> > often
> > > >>> > > triggered at a specific period, such as a fixed business peak
> > > >>> period, and
> > > >>> > > we cannot guarantee that the user will operate on the UI.
> > > >>> > > So should the task mechanism be adopted to ensure that it can
> be
> > > run
> > > >>> at a
> > > >>> > > specific period?
> > > >>> > >
> > > >>> >
> > > >>> > This makes sense to me, and it is a just enhance feature. It is
> > just
> > > a
> > > >>> > start time sampling rule.
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >  We don't have separated thread monitor view table, how about
> we
> > > add
> > > >>> an
> > > >>> > > icon
> > > >>> > >  at the segment list, and add icon at the first span of this
> > > segment
> > > >>> in
> > > >>> > >  trace detail view?
> > > >>> > >  I think the latter one should be an entrance of the thread
> view.
> > > >>> > >
> > > >>> > >
> > > >>> > > I think it's a good idea. The link I mentioned in one of the
> > > answers
> > > >>> > > above, I think it is also a convenient entry point.
> > > >>> > > The switch button I mentioned earlier is only a data filtering
> > item
> > > >>> in
> > > >>> > the
> > > >>> > > query of the trace list and does not need a separate table UI.
> > > >>> > >
> > > >>> >
> > > >>> > As you intend to have a separated page for thread sampling, it is
> > OK
> > > to
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >  If you have some visualization idea, drawn by any tool you
> like
> > > >>> > supporting
> > > >>> > >  comment, we could discuss it there. In my mind, we should
> > support
> > > >>> > > visualize
> > > >>> > >  the thread dump stack through the time windows, and support
> > > >>> aggregate
> > > >>> > them
> > > >>> > >  by choosing the continued stack snapshots on the time window.
> > > >>> > >
> > > >>> > >
> > > >>> > > I think we should find a front-end who is better at discussing
> > > >>> together
> > > >>> > > because this depends on how the front-end UI can be displayed.
> > > >>> > > BTW: I can provide code for the OAP server and sniffer, and the
> > > >>> frontend
> > > >>> > > may need to look for help in the community alone. Hope that any
> > > >>> front-end
> > > >>> > > friends can participate in the topic discussion.
> > > >>> > >
> > > >>> >
> > > >>> > Once you have the demo, I could loop our UI committers in for UI
> > side
> > > >>> > development. But UI committers may not be familiar with thread
> dump
> > > >>> context
> > > >>> > story. We need to resolve that first.
> > > >>> > Let's start up a demo, such as some slides on Google doc?
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > The above is my answer to all the questions, and I look forward
> > to
> > > >>> your
> > > >>> > > reply at any time. As more and more discussions took place, the
> > > >>> details
> > > >>> > > became more and more complete. This is good.
> > > >>> > > Everyone is welcome to discuss together if you have any
> questions
> > > or
> > > >>> good
> > > >>> > > ideas, please let me know.
> > > >>> > >
> > > >>> >
> > > >>> > I think we could move the discussion to the design doc as the
> next
> > > >>> step.
> > > >>> >
> > > >>> > Please use this
> > > >>> >
> > > >>> >
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit#
> > > >>> > Trite the design including
> > > >>> > 1. Key features
> > > >>> > 2. Protocol
> > > >>> > 3. Work mechanism
> > > >>> > 4. UI design, prototype
> > > >>> > and anything you think important before writing codes.
> > > >>> >
> > > >>> > This is SkyWalking CLI design doc, you could use it as a
> reference.
> > > >>> >
> > > >>> >
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1WBnRNF0ABxaSdBZo6Gv2hMzCQzj04YAePUdOyLWHWew/edit#
> > > >>> >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > > 原始邮件
> > > >>> > > 发件人:Sheng [email protected]
> > > >>> > > 收件人:[email protected]
> > > >>> > > 发送时间:2019年12月9日(周一) 10:50
> > > >>> > > 主题:Re: A proposal for Skywalking(thread monitor)
> > > >>> > >
> > > >>> > >
> > > >>> > > Hi Thanks for writing this proposal with a detailed design. My
> > > >>> comments
> > > >>> > > are inline. 741550557 [email protected] 于2019年12月8日周日
> 下午11:22写道：
> > > >>> Thanks
> > > >>> > > for your reply, I have carefully read these issues you
> mentioned,
> > > >>> and
> > > >>> > > these issues mentioned are very meaningful and critical. I will
> > > >>> give  you
> > > >>> > > technical details about the issues you mentioned below.  I find
> > > these
> > > >>> > > issues are related, so I will explain them in different
> > > dimensions.
> > > >>> > use
> > > >>> > > a different protocol to transmission trace and thread-stack:
> 1.
> > > add
> > > >>> a
> > > >>> > > boolean field in segment data, to record has thread monitored.
> > and
> > > >>> is
> > > >>> > good
> > > >>> > > for filter monitored trace in UI.  2. add new BootService,
> > storage
> > > >>> Map to
> > > >>> > > record relate trace-id and  trace-stack information.  As we
> > already
> > > >>> have
> > > >>> > > designed limit mechanism at backend and agent side(according to
> > > your
> > > >>> > > design), also the number would not be big(10 most likely), we
> > just
> > > >>> need a
> > > >>> > > list to storage the trace-id(s)  3. listen
> > > >>> > > TracingContextListener#afterFinished if the current segment has
> > > >>> thread
> > > >>> > > monitored, mark current trace-id don’t need to monitor anymore.
> > > >>> (Cause
> > > >>> > if
> > > >>> > > for-each the step 2 map, the remove operation will fail and
> throw
> > > >>> > > exception).  4. when thread-monitor main thread running, It
> will
> > > >>> for-each
> > > >>> > > step 2 map  and check is it don’t need monitor anymore, I will
> > put
> > > >>> data
> > > >>> > > into new data  carrier.  5. generate new thread-monitor gRPC
> > > >>> protocol to
> > > >>> > > send data from the data  carrier. The agent side design seems
> > > pretty
> > > >>> > good.
> > > >>> > >   the server receives thread-stack logic:  1. storage
> stack-stack
> > > >>> > > informations and trace-id/segment-id relations on a  different
> > > >>> table.  2.
> > > >>> > > check thread-monitor is need to be stop on receiving data or
> > > >>> schedule.
> > > >>> > > Could you explain the (2), what do you mean `stop`? I think if
> > your
> > > >>> > > sampling mechanism should include the sampling duration.
> > reduce
> > > >>> CPU
> > > >>> > and
> > > >>> > > memory in sniffer:  1. through the configuration of thread
> > > >>> monitoring in
> > > >>> > > the UI, you can  configure the performance loss. For example,
> set
> > > the
> > > >>> > > monitoring level: fast  monitoring (100ms), medium speed
> > monitoring
> > > >>> > > (500ms), slow speed monitoring  (1000ms).  The sampling period
> > > >>> depends on
> > > >>> > > how you are going to visualize it.  2. add new integer field on
> > per
> > > >>> > > thread-stack, if current thread-stack last  element same as
> last
> > > >>> time,
> > > >>> > > don’t need storage, just increment it. I think  it will save a
> > lot
> > > of
> > > >>> > > memory space. Highly doubt about this, reduce the memory,
> maybe,
> > > only
> > > >>> > > reduce if the codes are running the loop or facing lock issue.
> > But
> > > >>> if it
> > > >>> > is
> > > >>> > > neither of these two, they are different. Also, please consider
> > the
> > > >>> CPU
> > > >>> > > cost of the comparison of the stack. You need a performance
> > > >>> benchmark to
> > > >>> > > verify if you want this. 3. create new VM args to setting
> > > >>> thread-monitor
> > > >>> > > pool size, It dependence on  user, maybe default 3? (this can
> be
> > > >>> > discussed
> > > >>> > > later)  I think UI limit is enough. 3 seems good to me.  4.
> limit
> > > >>> > > thread-stack-element size to 100, I think it can resolve most
> of
> > > the
> > > >>> > > scenes already. It also can create a new VM args if need.
> > > multiple
> > > >>> > > sampling methods can choose :(just my current thoughts, can add
> > > >>> more)
> > > >>> > 1.
> > > >>> > > base on current client SamplingServcie, extra a new factor
> holder
> > > to
> > > >>> > > increment, and reset on schedule.  Yours may be a little more
> > > complex
> > > >>> > than
> > > >>> > > the current SamplingServcie, right? Based on the next rule. 2.
> > > >>> `first 5
> > > >>> > > traces of this endpoint in the next 5 mins`, it a good idea. My
> > > >>> > > understanding is that within a few minutes, each instance can
> > send
> > > a
> > > >>> > > specified number of traces.  The trace number and time window
> > > should
> > > >>> be
> > > >>> > > configurable, that is I mean more complex. Inthe current
> > > >>> SamplingServcie,
> > > >>> > > only n traces per 3 seconds. But here, it is a dynamic rule.
> > UI
> > > >>> > settings
> > > >>> > > and sniffer perception:  1. create a new button on the
> dashboard
> > > >>> page, It
> > > >>> > > can create or stop a  thread-monitor. It can be dynamic load
> > > >>> > thread-monitor
> > > >>> > > status when  reselecting endpoint.  I think at least should be
> a
> > > >>> level
> > > >>> > one
> > > >>> > > new page called configuration or command page, which could set
> up
> > > the
> > > >>> > > multiple sampling rule and visualize the existing tasks and
> > related
> > > >>> > > sampling data.  2. sniffer creates a new scheduled task to
> check
> > > the
> > > >>> > > current service has  need monitor endpoint each 5 seconds. (I
> see
> > > >>> current
> > > >>> > > sniffer has command  functions, feel that principle is the same
> > as
> > > >>> the
> > > >>> > > scheduler)  Seems reasonable.   thread-monitor on the
> UI:(That’s
> > > >>> just my
> > > >>> > > initial thoughts, I think there  will have a better way to
> show)
> > > 1.
> > > >>> When
> > > >>> > > switch to the trace page, I think we need to add a new switch
> > > >>> button to
> > > >>> > > filter thread-monitor trace.  2. make a new thread-monitor icon
> > on
> > > >>> the
> > > >>> > same
> > > >>> > > segment. It means it has  thread-stack information.  We don't
> > have
> > > >>> > > separated thread monitor view table, how about we add an icon
> at
> > > the
> > > >>> > > segment list, and add icon at the first span of this segment in
> > > trace
> > > >>> > > detail view? I think the latter one should be an entrance of
> the
> > > >>> thread
> > > >>> > > view. 3. show on the information sidebox when the user clicks
> the
> > > >>> > > thread-monitor  segment(any span). create a new tab, like the
> log
> > > >>> tab.
> > > >>> > If
> > > >>> > > you have some visualization idea, drawn by any tool you like
> > > >>> supporting
> > > >>> > > comment, we could discuss it there. In my mind, we should
> support
> > > >>> > visualize
> > > >>> > > the thread dump stack through the time windows, and support
> > > aggregate
> > > >>> > them
> > > >>> > > by choosing the continued stack snapshots on the time window.
> > > >>>  They're
> > > >>> > > just a description of my current implementation details for
> > > >>> > thread-monitor
> > > >>> > > if these seem to work. I can do some time planning for these
> > > tasks.
> > > >>> > Sorry,
> > > >>> > > my English is not very well, hope you can understand. Maybe
> > these
> > > >>> seem
> > > >>> > to
> > > >>> > > have some problem, any good idea or suggestion are welcome.
> Very
> > > >>> > > appreciated you to lead this new direction. It is a long term
> > task
> > > >>> but
> > > >>> > > should be interesting. :) Good work, carry on.      原始邮件
> > 发件人:Sheng
> > > >>> > > [email protected]  收件人:[email protected]
> > > >>> > > 发送时间:2019年12月8日(周日) 08:31  主题:Re: A proposal for
> > Skywalking(thread
> > > >>> > > monitor)    First of all, thanks for your proposal. Thread
> > > >>> monitoring is
> > > >>> > > super  important for application performance. So basically, I
> > agree
> > > >>> with
> > > >>> > > this  proposal. But for tech details, I think we need more
> > > >>> discussion in
> > > >>> > > the  following ways 1. Do you want to add thread status to the
> > > >>> trace? If
> > > >>> > > so, why  don't consider this as a UI level join? Because we
> could
> > > >>> know
> > > >>> > > thread id in  the trace when we create a span, right? Then we
> > have
> > > >>> all
> > > >>> > the
> > > >>> > > thread  dump(if), we could ask UI to query specific thread
> > context
> > > >>> based
> > > >>> > > on  timestamp and thread number(s). 2. For thread dump, I don't
> > > know
> > > >>> > > whether  you do the performance evaluation for this OP. From my
> > > >>> > > experiences, `get  all need thread monitor segment every 100
> > > >>> > milliseconds`
> > > >>> > > is a very high cost  in your application and agent. So, you may
> > > need
> > > >>> to
> > > >>> > be
> > > >>> > > careful about doing  this. 3. Endpoint related thread dump with
> > > some
> > > >>> > > sampling mechanisms makes  more sense to me. And this should be
> > > >>> activated
> > > >>> > > by UI. We should only  provide a conditional thread dump
> sampling
> > > >>> > > mechanism, such as `first 5  traces of this endpoint in the
> next
> > 5
> > > >>> mins`.
> > > >>> > > Jian Tan I think DaoCloud also  has customized this feature in
> > your
> > > >>> > > internal SkyWalking. Could you share  what you do? Sheng Wu 吴晟
> > > >>> Twitter,
> > > >>> > > wusheng1108 741550557 [email protected]  于2019年12月8日周日
> 上午12:14写道：
> > > >>> Hello
> > > >>> > > everyone, I would like to share a new  feature with skywalking,
> > > >>> called
> > > >>> > > “thread monitor”. Background When our  company used skywalking
> to
> > > APM
> > > >>> > > earlier, we found that many traces did not  have enough span to
> > > fill
> > > >>> up,
> > > >>> > > doubting whether there were some third-party  frameworks that
> we
> > > >>> didn't
> > > >>> > > enhance or programmers API usage errors such as  java CountDown
> > > >>> number
> > > >>> > is 3
> > > >>> > > but there are only 2 countdowns. So we decide  to write a new
> > > >>> feature to
> > > >>> > > monitor executing trace thread stack, then we  can get more
> > > >>> information
> > > >>> > on
> > > >>> > > the trace, quick known what’s happening on  that trace.
> Structure
> > > >>> > > Agent(thread monitor) — gRPC protocol — OAP  Server(Storage) —
> > > >>> > > Skywalking-Rocketbot-UI More detail OAP Server:  1. Storage
> witch
> > > >>> traces
> > > >>> > > need to monitor(i suggest storage on the endpoint,  add new
> > boolean
> > > >>> field
> > > >>> > > named needThreadMonitor) 2. Provide GraphQL API to  change
> > endpoint
> > > >>> > monitor
> > > >>> > > status. 3. Monitor Trace parse, storage thread  stack if the
> > > segment
> > > >>> has
> > > >>> > > any thread info. Skywalking-Rocketbot-UI: 1.  Add a new switch
> > > >>> button on
> > > >>> > > the dashboard, It can read or modify endpoint  status. 2. It
> will
> > > >>> show
> > > >>> > > every thread stack on click trace detail.  Agent: 1. setup two
> > new
> > > >>> > > BootService: 1) find any need thread monitor  endpoint in
> current
> > > >>> > service,
> > > >>> > > start on a new schedule take and works on  each minute. 2)
> start
> > > new
> > > >>> > > schedule task to get all need thread monitor  segment each 100
> > > >>> > > milliseconds, and put a new thread dump task to a global
> thread
> > > >>> > > pool(fixed, count number default 3). 2. check endpoint need
> > thread
> > > >>> > monitor
> > > >>> > > on create entry/local
> span(TracingConext#createEntry/LocalSpan).
> > > If
> > > >>> > need,
> > > >>> > > It will be marked and put into thread monitor map. 3. when
> > > >>> > TraceingContext
> > > >>> > > finishes, It will get thread has monitored, and send all
> thread
> > > >>> stack to
> > > >>> > > server. Finally, I don’t know it is a good idea to get  more
> > > >>> information
> > > >>> > on
> > > >>> > > trace? If you have any good ideas or suggestions on  this,
> please
> > > >>> let me
> > > >>> > > know. Mrpro
> > > >>> >
> > > >>>
> > > >> --
> > > > Sheng Wu 吴晟
> > > >
> > > > Apache SkyWalking
> > > > Apache Incubator
> > > > Apache ShardingSphere, ECharts, DolphinScheduler podlings
> > > > Zipkin
> > > > Twitter, wusheng1108
> > > >
> > >
> >
>

Re: A proposal for Skywalking(thread monitor)

Reply via email to