Re: A proposal for Skywalking(thread monitor)

Sheng Wu Tue, 17 Dec 2019 18:54:08 -0800

Hi

I have finished the review. Please take a look. Most are good, just some
suggestions.
I think we are closed to start the separated code level tasks.


Sheng Wu 吴晟
Twitter, wusheng1108


han liu <[email protected]> 于2019年12月15日周日 下午2:20写道：

> I have answered and resolved all your questions in the doc.
>
> For your UI design proposal, I think this is cool, and I also made a
> prototype modification based on my understanding of this UI.
> To make the prototype more convenient to view, I put two prototype links in
> the email for more convenient viewing:
>
> 1. add task modal:https://bwd5l7.axshare.com/#id=61ddhs&p=add-task-modal
> 2. thread monitor page:
> https://bwd5l7.axshare.com/#g=1&id=8sgf47&p=thread-monitor-page
>
>
> Sheng Wu <[email protected]> 于2019年12月14日周六 下午8:28写道：
>
> > Please update your whole documents, it seems a lot of mismatches exist in
> > different sections.
> >
> > Sheng Wu 吴晟
> > Twitter, wusheng1108
> >
> >
> > han liu <[email protected]> 于2019年12月14日周六 下午2:54写道：
> >
> > > Okay, I have modified these two issues.
> > >
> > > 1. New field called maxThreadThresold.
> > >     Added new input box to "add-task-modal" tab in prototype
> > >     Added new conditions for limiting the size of the field in doc
> > chapter
> > > "Conditions that can create thread monitoring tasks"
> > >     Added description of this field in doc chapter "Thread monitoring
> > table
> > > structures"
> > >
> > > 2. Modify endpoint id to the endpoint name.
> > >     This field was modified in doc chapter "Thread monitoring table
> > > structures"
> > >     Modify the input box to "add-task-modal" tab in prototype
> > >
> > > I will continue to modify for other issues in the comment.
> > >
> > >
> > > Sheng Wu <[email protected]> 于2019年12月14日周六 上午9:46写道：
> > >
> > > > han liu <[email protected]> 于2019年12月13日周五 下午10:40写道：
> > > >
> > > > > I see what you mean. I think this is a good feature, and I will
> > > summarize
> > > > > this into the design doc.
> > > > >
> > > > > If you have any good suggestions，please let me know.
> > > > >
> > > >
> > > > A simple way should be enough, set up sampling rule based on first
> span
> > > > operation name, rather than endpoint it.
> > > > In this case, plus #4056[1], there will be no id for local span or
> exit
> > > > span. but those two are used in the first span in the async scenario.
> > > >
> > > > [1] https://github.com/apache/skywalking/issues/4056
> > > >
> > > >
> > > > Sheng Wu 吴晟
> > > > Twitter, wusheng1108
> > > >
> > > >
> > > > >
> > > > > Sheng Wu <[email protected]> 于2019年12月13日周五 下午10:24写道：
> > > > >
> > > > > > Hi Han Liu
> > > > > >
> > > > > > One more reminder, a trace id in one instance could have multiple
> > > > threads
> > > > > > sampling in theory, such as across threads scenarios. We also
> > should
> > > > set
> > > > > a
> > > > > > threshold for this. Max 3 threads for one trace id maybe?
> > > > > >
> > > > > > Sheng Wu 吴晟
> > > > > > Twitter, wusheng1108
> > > > > >
> > > > > >
> > > > > > Sheng Wu <[email protected]> 于2019年12月13日周五 下午1:03写道：
> > > > > >
> > > > > > > Hi Han Liu and everyone
> > > > > > >
> > > > > > > I have submitted a design draft to the doc. Please take a look,
> > if
> > > > you
> > > > > > > have an issue, please let me known. We could set up a online
> > > meeting
> > > > > too.
> > > > > > >
> > > > > > > Sheng Wu <[email protected]>于2019年12月12日 周四下午8:49写道：
> > > > > > >
> > > > > > >> Hi Han Liu
> > > > > > >>
> > > > > > >> I have replied the design with the most important key points I
> > > > expect.
> > > > > > >> Let's discuss those. After we are on the same page, we could
> > > > continue
> > > > > on
> > > > > > >> more details.
> > > > > > >>
> > > > > > >> Sheng Wu 吴晟
> > > > > > >> Twitter, wusheng1108
> > > > > > >>
> > > > > > >>
> > > > > > >> han liu <[email protected]> 于2019年12月12日周四 下午2:26写道：
> > > > > > >>
> > > > > > >>> Due to formatting issues with previous mailboxes, they have
> > been
> > > > > > replaced
> > > > > > >>> with new ones.
> > > > > > >>>
> > > > > > >>> I have completed some of the features in the google doc, and
> > can
> > > > > > provide
> > > > > > >>> your comments and improvements. I will continue to improve
> the
> > > > > > following
> > > > > > >>> functions in the documentation.
> > > > > > >>> The documentation is the same as you previously sent me. To
> > > prevent
> > > > > > >>> trouble, I'll post the link again here.
> > > > > > >>>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit#
> > > > > > >>>
> > > > > > >>> Sheng Wu <[email protected]> 于2019年12月10日周二
> 上午10:46写道：
> > > > > > >>>
> > > > > > >>> > 741550557 <[email protected]> 于2019年12月9日周一 下午9:42写道：
> > > > > > >>> >
> > > > > > >>> > > Thank for your reply, the issues you mentioned are very
> > > > critical
> > > > > > and
> > > > > > >>> > > meaningful.
> > > > > > >>> > > There I will answer what you mentioned. Sorry, I'm not
> good
> > > at
> > > > > > >>> comment
> > > > > > >>> > > mode, so I use different colors and “ “ prefix to QA.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  As we already have designed limit mechanism at backend
> and
> > > > agent
> > > > > > >>> > >  side(according to your design), also the number would
> not
> > be
> > > > > > big(10
> > > > > > >>> most
> > > > > > >>> > >  likely), we just need a list to storage the trace-id(s)
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > If just need a list to storage trace-id(s), so how can I
> > map
> > > to
> > > > > the
> > > > > > >>> > > thread? I hope to use the map to quickly find thread info
> > > from
> > > > > > >>> trace-id.
> > > > > > >>> > > How can I get thread-stack information from your way?
> Could
> > > you
> > > > > > >>> please
> > > > > > >>> > > help elaborate?
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > Why do you need to do that? You just save a list of thread
> > ids
> > > > > which
> > > > > > >>> should
> > > > > > >>> > do thread dump, or remove some thread id from them when the
> > > trace
> > > > > id
> > > > > > is
> > > > > > >>> > finished.
> > > > > > >>> > This is easy to do this by doing a loop search in the list.
> > > > Right?
> > > > > > >>> > Thread-stack is in the list, they are stored in an element.
> > > Also,
> > > > > > they
> > > > > > >>> are
> > > > > > >>> > in a list too.
> > > > > > >>> >
> > > > > > >>> > I think you were thinking the same all stack in a single
> map?
> > > > That
> > > > > > will
> > > > > > >>> > cause a very dangerous memory and GC risk.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  Could you explain the (2), what do you mean `stop`? I
> > think
> > > if
> > > > > > your
> > > > > > >>> > >  sampling mechanism should include the sampling duration.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > As far as the communication between the sniffer and the
> OAP
> > > > > > server, I
> > > > > > >>> > hope
> > > > > > >>> > > the sniffer only needs to obtain the thread-monitor task
> > that
> > > > > needs
> > > > > > >>> to be
> > > > > > >>> > > monitored at this time. The termination condition can be
> > > > stopped
> > > > > by
> > > > > > >>> the
> > > > > > >>> > > sniffer or the OAP server.
> > > > > > >>> > > If It’s just an OAP server notification, it may be more
> > > > > > complicated.
> > > > > > >>> > Cause
> > > > > > >>> > > OAP server need record sniffer has received the current
> > > > command,
> > > > > > and
> > > > > > >>> > > sniffer is not stable, such as sniffer has shutdown when
> > > > > receiving
> > > > > > >>> the
> > > > > > >>> > > command, at this time, no thread information I have been
> > > > > collected.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > I think that the active calculation termination by the
> OAP
> > > > server
> > > > > > can
> > > > > > >>> > make
> > > > > > >>> > > the monitoring more controllable, of course, the client
> can
> > > > also
> > > > > > >>> actively
> > > > > > >>> > > report the end.
> > > > > > >>> > > I think it’s necessary to provide a protection mechanism
> > for
> > > > the
> > > > > > >>> sniffer
> > > > > > >>> > > side, and it can be released quickly when the business
> peak
> > > > > period
> > > > > > >>> or the
> > > > > > >>> > > probe suddenly occupies a lot of CPU / memory resources.
> > > > > Therefore,
> > > > > > >>> the
> > > > > > >>> > > function of stopping monitoring can be provided in the UI
> > > > > > interface,
> > > > > > >>> so
> > > > > > >>> > > that the sniffer can recover.
> > > > > > >>> > > Sampling duration is required, but only as a default
> > > > termination
> > > > > > >>> > > thread-monitor condition.
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > But you should know, in the real case, the thread dump
> > monitor
> > > > is a
> > > > > > >>> > sampling mechanism, you are even hard to know where they
> are
> > > > > > happening.
> > > > > > >>> > Then you have to send the stop notification to every
> > instance.
> > > > > > >>> > Even you could send the notification, but could you explain
> > how
> > > > you
> > > > > > >>> know to
> > > > > > >>> > stop?
> > > > > > >>> > The scenario is, you are facing an issue, which trace and
> > > metrics
> > > > > > can't
> > > > > > >>> > explain, so you active thread dump, right? At the same
> time,
> > > you
> > > > > want
> > > > > > >>> to
> > > > > > >>> > stop?
> > > > > > >>> >
> > > > > > >>> > CPU and memory resources should be guaranteed by design
> > level,
> > > > such
> > > > > > as
> > > > > > >>> > 1. Limited thread dump task for one service.
> > > > > > >>> > 2. Limited thread dump traces in the certain time window.
> > > > > > >>> > For example, the OAP backend/UI would say, you only could
> > > > > > >>> > 1. Set 3 thread dump commands in the same time window.
> > > > > > >>> > 2. Every command will require the sampling thread dump
> number
> > > > > should
> > > > > > be
> > > > > > >>> > less than 5 traces. At the same time, in order to make this
> > > > > sampling
> > > > > > >>> works,
> > > > > > >>> > only active sampling thread dump after the trace executed
> > more
> > > > than
> > > > > > >>> > 200ms(value is an example only).
> > > > > > >>> > 3. Thread dump could be sent to the backend duration
> sampling
> > > to
> > > > > > >>> reduce the
> > > > > > >>> > memory cache.
> > > > > > >>> > 4. Thread dump period should not less than 5ms, recommend
> > 20ms
> > > > > > >>> > 5. How depth the thread dump should do
> > > > > > >>> >
> > > > > > >>> > We need a very detailed design, above are just my thoughts,
> > in
> > > > > order
> > > > > > to
> > > > > > >>> > share the idea, the safe of the agent should not be by UI
> > > button.
> > > > > > >>> > Otherwise, your online system will be very dangerous, which
> > is
> > > > not
> > > > > > the
> > > > > > >>> > design goal of SkyWalking.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  The sampling period depends on how you are going to
> > > visualize
> > > > > it.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > Yes, I agree. I hope can provide a select/input let trace
> > > count
> > > > > and
> > > > > > >>> time
> > > > > > >>> > > windows can be configurable in UI. Of course, this is my
> > > > current
> > > > > > >>> idea,
> > > > > > >>> > and
> > > > > > >>> > > if there have other plains, I will adopt it.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  Highly doubt about this, reduce the memory, maybe, only
> > > reduce
> > > > > if
> > > > > > >>> the
> > > > > > >>> > > codes
> > > > > > >>> > >  are running the loop or facing lock issue. But if it is
> > > > neither
> > > > > of
> > > > > > >>> these
> > > > > > >>> > >  two, they are different.
> > > > > > >>> > >  Also, please consider the CPU cost of the comparison of
> > the
> > > > > stack.
> > > > > > >>> You
> > > > > > >>> > > need
> > > > > > >>> > >  a performance benchmark to verify if you want this.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > I didn’t understand that first sentence. In my personal
> > > > > experience,
> > > > > > >>> most
> > > > > > >>> > > of the cases are blocking in the lock(socket/local) and
> > > running
> > > > > > >>> loop. I
> > > > > > >>> > > have not imagined any other cases?
> > > > > > >>> > > For the second sentence, I think I can add a
> > > > thread-stack-element
> > > > > > >>> field
> > > > > > >>> > to
> > > > > > >>> > > storage the top-level element of last stack information.
> > When
> > > > get
> > > > > > >>> stack
> > > > > > >>> > > information next time, I can compare the current
> top-level
> > > > > element
> > > > > > >>> that
> > > > > > >>> > is
> > > > > > >>> > > the same with that field.
> > > > > > >>> > > I do this mainly to reduce duplicate thread-stack
> > information
> > > > > form
> > > > > > >>> taking
> > > > > > >>> > > up too much memory space, this is a way to optimizing
> > memory
> > > > > space.
> > > > > > >>> It
> > > > > > >>> > can
> > > > > > >>> > > consider remove it, or do you have a better memory-saving
> > > > > solution?
> > > > > > >>> After
> > > > > > >>> > > all, memory and CPU resources are very valuable in the
> > > sniffer.
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > I know you mean about reducing the memory, but do you
> > consider
> > > > how
> > > > > > >>> much CPU
> > > > > > >>> > you will cost do a full thread dump comparison? The thread
> > dump
> > > > > could
> > > > > > >>> > easily be hundreds of lines in Java.
> > > > > > >>> > I mean this is a tradeoff, CPU or memory. If you are just
> > using
> > > > > > limited
> > > > > > >>> > memory, before you could send the snapshot to backend while
> > > > > > collecting
> > > > > > >>> new,
> > > > > > >>> > even could save into the disk(if really necessary).
> > > > > > >>> > In my experience, compress is always very high risk in the
> > > agent,
> > > > > if
> > > > > > >>> you
> > > > > > >>> > want to do that, you need a benchmark test to improve that,
> > > this
> > > > > CPU
> > > > > > >>> cost
> > > > > > >>> > is small enough.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  The trace number and time window should be configurable,
> > > that
> > > > > is I
> > > > > > >>> mean
> > > > > > >>> > >  more complex. Inthe current SamplingServcie, only n
> traces
> > > > per 3
> > > > > > >>> > seconds.
> > > > > > >>> > >  But here, it is a dynamic rule.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > I expect that it can be configured at the UI level for
> > > special
> > > > > > trace
> > > > > > >>> > count
> > > > > > >>> > > and time windows as I said above.
> > > > > > >>> > > For SamplingService, my previous tech design was not
> > rigorous
> > > > > > >>> enough, and
> > > > > > >>> > > there were indeed problems.
> > > > > > >>> > > Maybe we need to extend a new SamplingService, build a
> > > mapping
> > > > > base
> > > > > > >>> on
> > > > > > >>> > > endpoint-id and AtomicInteger.
> > > > > > >>> > > For `first 5 traces of this endpoint in the next 5 mins`,
> > > just
> > > > > need
> > > > > > >>> to
> > > > > > >>> > > increment it.
> > > > > > >>> > > For sampling, maybe use another schedule task to reset
> > > > > > AtomicInteger
> > > > > > >>> > value.
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > You could avoid map, by using ArrayList with
> > > > > > >>> RangeAtomicInteger(SkyWalking
> > > > > > >>> > provides that) to let the trace context to get the slot.
> > > > > > >>> > Also, you are considering `active sampling after trace
> > > execution
> > > > > time
> > > > > > >>> more
> > > > > > >>> > than xxx ms`, you should add remove mechanism during
> runtime.
> > > > > > >>> > Anyway, try your best to avoid using Map, especially this
> map
> > > > could
> > > > > > be
> > > > > > >>> > changed in the runtime.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  I think at least should be a level one new page called
> > > > > > >>> configuration or
> > > > > > >>> > >  command page, which could set up the multiple sampling
> > rule
> > > > and
> > > > > > >>> > visualize
> > > > > > >>> > >  the existing tasks and related sampling data.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > I think it’s necessary to add a new page to the
> > configuration
> > > > > > >>> > > thread-monitor task, I think the specific UI display
> should
> > > be
> > > > > > >>> designed
> > > > > > >>> > in
> > > > > > >>> > > detail.
> > > > > > >>> > > For example, what I expected is similar to the trace
> page.
> > > The
> > > > > left
> > > > > > >>> side
> > > > > > >>> > > displays the configuration, and the right side quickly
> > > displays
> > > > > the
> > > > > > >>> > related
> > > > > > >>> > > trace list. When clicked, it quickly links to the trace
> > page
> > > > and
> > > > > > >>> displays
> > > > > > >>> > > the sidebox display.
> > > > > > >>> > > I ’m not good at this. Do you have any good plans?
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > UI is the thing that is hard to discuss by text, so I am
> > pretty
> > > > > sure,
> > > > > > >>> we
> > > > > > >>> > need some demo(could not be the codes, that is I mean drew
> > by a
> > > > > tool)
> > > > > > >>> > It is OK to show a trace with thread dumps on another page,
> > > even
> > > > > > better
> > > > > > >>> > linking to your task ID.
> > > > > > >>> > But this kind of abstract description is hard to continue,
> no
> > > > > > details I
> > > > > > >>> > mean.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > > And I feel that the two of us have a different
> > understanding
> > > of
> > > > > the
> > > > > > >>> > > configuration object. I think it is more of a task than a
> > > > > command.
> > > > > > I
> > > > > > >>> > don't
> > > > > > >>> > > know which way is better?
> > > > > > >>> > > I suddenly thought of a problem. I think that some real
> > > > problems
> > > > > > are
> > > > > > >>> > often
> > > > > > >>> > > triggered at a specific period, such as a fixed business
> > peak
> > > > > > >>> period, and
> > > > > > >>> > > we cannot guarantee that the user will operate on the UI.
> > > > > > >>> > > So should the task mechanism be adopted to ensure that it
> > can
> > > > be
> > > > > > run
> > > > > > >>> at a
> > > > > > >>> > > specific period?
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > This makes sense to me, and it is a just enhance feature.
> It
> > is
> > > > > just
> > > > > > a
> > > > > > >>> > start time sampling rule.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  We don't have separated thread monitor view table, how
> > about
> > > > we
> > > > > > add
> > > > > > >>> an
> > > > > > >>> > > icon
> > > > > > >>> > >  at the segment list, and add icon at the first span of
> > this
> > > > > > segment
> > > > > > >>> in
> > > > > > >>> > >  trace detail view?
> > > > > > >>> > >  I think the latter one should be an entrance of the
> thread
> > > > view.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > I think it's a good idea. The link I mentioned in one of
> > the
> > > > > > answers
> > > > > > >>> > > above, I think it is also a convenient entry point.
> > > > > > >>> > > The switch button I mentioned earlier is only a data
> > > filtering
> > > > > item
> > > > > > >>> in
> > > > > > >>> > the
> > > > > > >>> > > query of the trace list and does not need a separate
> table
> > > UI.
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > As you intend to have a separated page for thread sampling,
> > it
> > > is
> > > > > OK
> > > > > > to
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >  If you have some visualization idea, drawn by any tool
> you
> > > > like
> > > > > > >>> > supporting
> > > > > > >>> > >  comment, we could discuss it there. In my mind, we
> should
> > > > > support
> > > > > > >>> > > visualize
> > > > > > >>> > >  the thread dump stack through the time windows, and
> > support
> > > > > > >>> aggregate
> > > > > > >>> > them
> > > > > > >>> > >  by choosing the continued stack snapshots on the time
> > > window.
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > I think we should find a front-end who is better at
> > > discussing
> > > > > > >>> together
> > > > > > >>> > > because this depends on how the front-end UI can be
> > > displayed.
> > > > > > >>> > > BTW: I can provide code for the OAP server and sniffer,
> and
> > > the
> > > > > > >>> frontend
> > > > > > >>> > > may need to look for help in the community alone. Hope
> that
> > > any
> > > > > > >>> front-end
> > > > > > >>> > > friends can participate in the topic discussion.
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > Once you have the demo, I could loop our UI committers in
> for
> > > UI
> > > > > side
> > > > > > >>> > development. But UI committers may not be familiar with
> > thread
> > > > dump
> > > > > > >>> context
> > > > > > >>> > story. We need to resolve that first.
> > > > > > >>> > Let's start up a demo, such as some slides on Google doc?
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > The above is my answer to all the questions, and I look
> > > forward
> > > > > to
> > > > > > >>> your
> > > > > > >>> > > reply at any time. As more and more discussions took
> place,
> > > the
> > > > > > >>> details
> > > > > > >>> > > became more and more complete. This is good.
> > > > > > >>> > > Everyone is welcome to discuss together if you have any
> > > > questions
> > > > > > or
> > > > > > >>> good
> > > > > > >>> > > ideas, please let me know.
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> > I think we could move the discussion to the design doc as
> the
> > > > next
> > > > > > >>> step.
> > > > > > >>> >
> > > > > > >>> > Please use this
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1rxMf1WN3PaFaZp7r8JmtwfdkmjLTcFW_ETAZv5FIU-s/edit#
> > > > > > >>> > Trite the design including
> > > > > > >>> > 1. Key features
> > > > > > >>> > 2. Protocol
> > > > > > >>> > 3. Work mechanism
> > > > > > >>> > 4. UI design, prototype
> > > > > > >>> > and anything you think important before writing codes.
> > > > > > >>> >
> > > > > > >>> > This is SkyWalking CLI design doc, you could use it as a
> > > > reference.
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1WBnRNF0ABxaSdBZo6Gv2hMzCQzj04YAePUdOyLWHWew/edit#
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > 原始邮件
> > > > > > >>> > > 发件人:Sheng [email protected]
> > > > > > >>> > > 收件人:[email protected]
> > > > > > >>> > > 发送时间:2019年12月9日(周一) 10:50
> > > > > > >>> > > 主题:Re: A proposal for Skywalking(thread monitor)
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > Hi Thanks for writing this proposal with a detailed
> design.
> > > My
> > > > > > >>> comments
> > > > > > >>> > > are inline. 741550557 [email protected] 于2019年12月8日周日
> > > > 下午11:22写道：
> > > > > > >>> Thanks
> > > > > > >>> > > for your reply, I have carefully read these issues you
> > > > mentioned,
> > > > > > >>> and
> > > > > > >>> > > these issues mentioned are very meaningful and critical.
> I
> > > will
> > > > > > >>> give  you
> > > > > > >>> > > technical details about the issues you mentioned below.
> I
> > > find
> > > > > > these
> > > > > > >>> > > issues are related, so I will explain them in different
> > > > > > dimensions.
> > > > > > >>> > use
> > > > > > >>> > > a different protocol to transmission trace and
> > thread-stack:
> > > > 1.
> > > > > > add
> > > > > > >>> a
> > > > > > >>> > > boolean field in segment data, to record has thread
> > > monitored.
> > > > > and
> > > > > > >>> is
> > > > > > >>> > good
> > > > > > >>> > > for filter monitored trace in UI.  2. add new
> BootService,
> > > > > storage
> > > > > > >>> Map to
> > > > > > >>> > > record relate trace-id and  trace-stack information.  As
> we
> > > > > already
> > > > > > >>> have
> > > > > > >>> > > designed limit mechanism at backend and agent
> > side(according
> > > to
> > > > > > your
> > > > > > >>> > > design), also the number would not be big(10 most
> likely),
> > we
> > > > > just
> > > > > > >>> need a
> > > > > > >>> > > list to storage the trace-id(s)  3. listen
> > > > > > >>> > > TracingContextListener#afterFinished if the current
> segment
> > > has
> > > > > > >>> thread
> > > > > > >>> > > monitored, mark current trace-id don’t need to monitor
> > > anymore.
> > > > > > >>> (Cause
> > > > > > >>> > if
> > > > > > >>> > > for-each the step 2 map, the remove operation will fail
> and
> > > > throw
> > > > > > >>> > > exception).  4. when thread-monitor main thread running,
> It
> > > > will
> > > > > > >>> for-each
> > > > > > >>> > > step 2 map  and check is it don’t need monitor anymore, I
> > > will
> > > > > put
> > > > > > >>> data
> > > > > > >>> > > into new data  carrier.  5. generate new thread-monitor
> > gRPC
> > > > > > >>> protocol to
> > > > > > >>> > > send data from the data  carrier. The agent side design
> > seems
> > > > > > pretty
> > > > > > >>> > good.
> > > > > > >>> > >   the server receives thread-stack logic:  1. storage
> > > > stack-stack
> > > > > > >>> > > informations and trace-id/segment-id relations on a
> > > different
> > > > > > >>> table.  2.
> > > > > > >>> > > check thread-monitor is need to be stop on receiving data
> > or
> > > > > > >>> schedule.
> > > > > > >>> > > Could you explain the (2), what do you mean `stop`? I
> think
> > > if
> > > > > your
> > > > > > >>> > > sampling mechanism should include the sampling duration.
> > > > > reduce
> > > > > > >>> CPU
> > > > > > >>> > and
> > > > > > >>> > > memory in sniffer:  1. through the configuration of
> thread
> > > > > > >>> monitoring in
> > > > > > >>> > > the UI, you can  configure the performance loss. For
> > example,
> > > > set
> > > > > > the
> > > > > > >>> > > monitoring level: fast  monitoring (100ms), medium speed
> > > > > monitoring
> > > > > > >>> > > (500ms), slow speed monitoring  (1000ms).  The sampling
> > > period
> > > > > > >>> depends on
> > > > > > >>> > > how you are going to visualize it.  2. add new integer
> > field
> > > on
> > > > > per
> > > > > > >>> > > thread-stack, if current thread-stack last  element same
> as
> > > > last
> > > > > > >>> time,
> > > > > > >>> > > don’t need storage, just increment it. I think  it will
> > save
> > > a
> > > > > lot
> > > > > > of
> > > > > > >>> > > memory space. Highly doubt about this, reduce the memory,
> > > > maybe,
> > > > > > only
> > > > > > >>> > > reduce if the codes are running the loop or facing lock
> > > issue.
> > > > > But
> > > > > > >>> if it
> > > > > > >>> > is
> > > > > > >>> > > neither of these two, they are different. Also, please
> > > consider
> > > > > the
> > > > > > >>> CPU
> > > > > > >>> > > cost of the comparison of the stack. You need a
> performance
> > > > > > >>> benchmark to
> > > > > > >>> > > verify if you want this. 3. create new VM args to setting
> > > > > > >>> thread-monitor
> > > > > > >>> > > pool size, It dependence on  user, maybe default 3? (this
> > can
> > > > be
> > > > > > >>> > discussed
> > > > > > >>> > > later)  I think UI limit is enough. 3 seems good to me.
> 4.
> > > > limit
> > > > > > >>> > > thread-stack-element size to 100, I think it can resolve
> > most
> > > > of
> > > > > > the
> > > > > > >>> > > scenes already. It also can create a new VM args if need.
> > > > > > multiple
> > > > > > >>> > > sampling methods can choose :(just my current thoughts,
> can
> > > add
> > > > > > >>> more)
> > > > > > >>> > 1.
> > > > > > >>> > > base on current client SamplingServcie, extra a new
> factor
> > > > holder
> > > > > > to
> > > > > > >>> > > increment, and reset on schedule.  Yours may be a little
> > more
> > > > > > complex
> > > > > > >>> > than
> > > > > > >>> > > the current SamplingServcie, right? Based on the next
> rule.
> > > 2.
> > > > > > >>> `first 5
> > > > > > >>> > > traces of this endpoint in the next 5 mins`, it a good
> > idea.
> > > My
> > > > > > >>> > > understanding is that within a few minutes, each instance
> > can
> > > > > send
> > > > > > a
> > > > > > >>> > > specified number of traces.  The trace number and time
> > window
> > > > > > should
> > > > > > >>> be
> > > > > > >>> > > configurable, that is I mean more complex. Inthe current
> > > > > > >>> SamplingServcie,
> > > > > > >>> > > only n traces per 3 seconds. But here, it is a dynamic
> > rule.
> > > > > UI
> > > > > > >>> > settings
> > > > > > >>> > > and sniffer perception:  1. create a new button on the
> > > > dashboard
> > > > > > >>> page, It
> > > > > > >>> > > can create or stop a  thread-monitor. It can be dynamic
> > load
> > > > > > >>> > thread-monitor
> > > > > > >>> > > status when  reselecting endpoint.  I think at least
> should
> > > be
> > > > a
> > > > > > >>> level
> > > > > > >>> > one
> > > > > > >>> > > new page called configuration or command page, which
> could
> > > set
> > > > up
> > > > > > the
> > > > > > >>> > > multiple sampling rule and visualize the existing tasks
> and
> > > > > related
> > > > > > >>> > > sampling data.  2. sniffer creates a new scheduled task
> to
> > > > check
> > > > > > the
> > > > > > >>> > > current service has  need monitor endpoint each 5
> seconds.
> > (I
> > > > see
> > > > > > >>> current
> > > > > > >>> > > sniffer has command  functions, feel that principle is
> the
> > > same
> > > > > as
> > > > > > >>> the
> > > > > > >>> > > scheduler)  Seems reasonable.   thread-monitor on the
> > > > UI:(That’s
> > > > > > >>> just my
> > > > > > >>> > > initial thoughts, I think there  will have a better way
> to
> > > > show)
> > > > > > 1.
> > > > > > >>> When
> > > > > > >>> > > switch to the trace page, I think we need to add a new
> > switch
> > > > > > >>> button to
> > > > > > >>> > > filter thread-monitor trace.  2. make a new
> thread-monitor
> > > icon
> > > > > on
> > > > > > >>> the
> > > > > > >>> > same
> > > > > > >>> > > segment. It means it has  thread-stack information.  We
> > don't
> > > > > have
> > > > > > >>> > > separated thread monitor view table, how about we add an
> > icon
> > > > at
> > > > > > the
> > > > > > >>> > > segment list, and add icon at the first span of this
> > segment
> > > in
> > > > > > trace
> > > > > > >>> > > detail view? I think the latter one should be an entrance
> > of
> > > > the
> > > > > > >>> thread
> > > > > > >>> > > view. 3. show on the information sidebox when the user
> > clicks
> > > > the
> > > > > > >>> > > thread-monitor  segment(any span). create a new tab, like
> > the
> > > > log
> > > > > > >>> tab.
> > > > > > >>> > If
> > > > > > >>> > > you have some visualization idea, drawn by any tool you
> > like
> > > > > > >>> supporting
> > > > > > >>> > > comment, we could discuss it there. In my mind, we should
> > > > support
> > > > > > >>> > visualize
> > > > > > >>> > > the thread dump stack through the time windows, and
> support
> > > > > > aggregate
> > > > > > >>> > them
> > > > > > >>> > > by choosing the continued stack snapshots on the time
> > window.
> > > > > > >>>  They're
> > > > > > >>> > > just a description of my current implementation details
> for
> > > > > > >>> > thread-monitor
> > > > > > >>> > > if these seem to work. I can do some time planning for
> > these
> > > > > > tasks.
> > > > > > >>> > Sorry,
> > > > > > >>> > > my English is not very well, hope you can understand.
> Maybe
> > > > > these
> > > > > > >>> seem
> > > > > > >>> > to
> > > > > > >>> > > have some problem, any good idea or suggestion are
> welcome.
> > > > Very
> > > > > > >>> > > appreciated you to lead this new direction. It is a long
> > term
> > > > > task
> > > > > > >>> but
> > > > > > >>> > > should be interesting. :) Good work, carry on.      原始邮件
> > > > > 发件人:Sheng
> > > > > > >>> > > [email protected]  收件人:
> > > [email protected]
> > > > > > >>> > > 发送时间:2019年12月8日(周日) 08:31  主题:Re: A proposal for
> > > > > Skywalking(thread
> > > > > > >>> > > monitor)    First of all, thanks for your proposal.
> Thread
> > > > > > >>> monitoring is
> > > > > > >>> > > super  important for application performance. So
> > basically, I
> > > > > agree
> > > > > > >>> with
> > > > > > >>> > > this  proposal. But for tech details, I think we need
> more
> > > > > > >>> discussion in
> > > > > > >>> > > the  following ways 1. Do you want to add thread status
> to
> > > the
> > > > > > >>> trace? If
> > > > > > >>> > > so, why  don't consider this as a UI level join? Because
> we
> > > > could
> > > > > > >>> know
> > > > > > >>> > > thread id in  the trace when we create a span, right?
> Then
> > we
> > > > > have
> > > > > > >>> all
> > > > > > >>> > the
> > > > > > >>> > > thread  dump(if), we could ask UI to query specific
> thread
> > > > > context
> > > > > > >>> based
> > > > > > >>> > > on  timestamp and thread number(s). 2. For thread dump, I
> > > don't
> > > > > > know
> > > > > > >>> > > whether  you do the performance evaluation for this OP.
> > From
> > > my
> > > > > > >>> > > experiences, `get  all need thread monitor segment every
> > 100
> > > > > > >>> > milliseconds`
> > > > > > >>> > > is a very high cost  in your application and agent. So,
> you
> > > may
> > > > > > need
> > > > > > >>> to
> > > > > > >>> > be
> > > > > > >>> > > careful about doing  this. 3. Endpoint related thread
> dump
> > > with
> > > > > > some
> > > > > > >>> > > sampling mechanisms makes  more sense to me. And this
> > should
> > > be
> > > > > > >>> activated
> > > > > > >>> > > by UI. We should only  provide a conditional thread dump
> > > > sampling
> > > > > > >>> > > mechanism, such as `first 5  traces of this endpoint in
> the
> > > > next
> > > > > 5
> > > > > > >>> mins`.
> > > > > > >>> > > Jian Tan I think DaoCloud also  has customized this
> feature
> > > in
> > > > > your
> > > > > > >>> > > internal SkyWalking. Could you share  what you do? Sheng
> Wu
> > > 吴晟
> > > > > > >>> Twitter,
> > > > > > >>> > > wusheng1108 741550557 [email protected]  于2019年12月8日周日
> > > > 上午12:14写道：
> > > > > > >>> Hello
> > > > > > >>> > > everyone, I would like to share a new  feature with
> > > skywalking,
> > > > > > >>> called
> > > > > > >>> > > “thread monitor”. Background When our  company used
> > > skywalking
> > > > to
> > > > > > APM
> > > > > > >>> > > earlier, we found that many traces did not  have enough
> > span
> > > to
> > > > > > fill
> > > > > > >>> up,
> > > > > > >>> > > doubting whether there were some third-party  frameworks
> > that
> > > > we
> > > > > > >>> didn't
> > > > > > >>> > > enhance or programmers API usage errors such as  java
> > > CountDown
> > > > > > >>> number
> > > > > > >>> > is 3
> > > > > > >>> > > but there are only 2 countdowns. So we decide  to write a
> > new
> > > > > > >>> feature to
> > > > > > >>> > > monitor executing trace thread stack, then we  can get
> more
> > > > > > >>> information
> > > > > > >>> > on
> > > > > > >>> > > the trace, quick known what’s happening on  that trace.
> > > > Structure
> > > > > > >>> > > Agent(thread monitor) — gRPC protocol — OAP
> > Server(Storage)
> > > —
> > > > > > >>> > > Skywalking-Rocketbot-UI More detail OAP Server:  1.
> Storage
> > > > witch
> > > > > > >>> traces
> > > > > > >>> > > need to monitor(i suggest storage on the endpoint,  add
> new
> > > > > boolean
> > > > > > >>> field
> > > > > > >>> > > named needThreadMonitor) 2. Provide GraphQL API to
> change
> > > > > endpoint
> > > > > > >>> > monitor
> > > > > > >>> > > status. 3. Monitor Trace parse, storage thread  stack if
> > the
> > > > > > segment
> > > > > > >>> has
> > > > > > >>> > > any thread info. Skywalking-Rocketbot-UI: 1.  Add a new
> > > switch
> > > > > > >>> button on
> > > > > > >>> > > the dashboard, It can read or modify endpoint  status. 2.
> > It
> > > > will
> > > > > > >>> show
> > > > > > >>> > > every thread stack on click trace detail.  Agent: 1.
> setup
> > > two
> > > > > new
> > > > > > >>> > > BootService: 1) find any need thread monitor  endpoint in
> > > > current
> > > > > > >>> > service,
> > > > > > >>> > > start on a new schedule take and works on  each minute.
> 2)
> > > > start
> > > > > > new
> > > > > > >>> > > schedule task to get all need thread monitor  segment
> each
> > > 100
> > > > > > >>> > > milliseconds, and put a new thread dump task to a global
> > > > thread
> > > > > > >>> > > pool(fixed, count number default 3). 2. check endpoint
> need
> > > > > thread
> > > > > > >>> > monitor
> > > > > > >>> > > on create entry/local
> > > > span(TracingConext#createEntry/LocalSpan).
> > > > > > If
> > > > > > >>> > need,
> > > > > > >>> > > It will be marked and put into thread monitor map. 3.
> when
> > > > > > >>> > TraceingContext
> > > > > > >>> > > finishes, It will get thread has monitored, and send all
> > > > thread
> > > > > > >>> stack to
> > > > > > >>> > > server. Finally, I don’t know it is a good idea to get
> > more
> > > > > > >>> information
> > > > > > >>> > on
> > > > > > >>> > > trace? If you have any good ideas or suggestions on
> this,
> > > > please
> > > > > > >>> let me
> > > > > > >>> > > know. Mrpro
> > > > > > >>> >
> > > > > > >>>
> > > > > > >> --
> > > > > > > Sheng Wu 吴晟
> > > > > > >
> > > > > > > Apache SkyWalking
> > > > > > > Apache Incubator
> > > > > > > Apache ShardingSphere, ECharts, DolphinScheduler podlings
> > > > > > > Zipkin
> > > > > > > Twitter, wusheng1108
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: A proposal for Skywalking(thread monitor)

Reply via email to