Pyroscope[1] and Parca[2] are other options for less-intrusive profiling (&
great fits for k8s) that move the burden from Flink & its UI to tools that
are purpose-built for this use case. Perhaps we could investigate what it
would take (if anything) to make Flink compatible with those?

Best,
Austin

[1]: https://pyroscope.io/
[2]: https://www.parca.dev/


On Fri, Feb 11, 2022 at 8:33 AM Alexander Fedulov <alexan...@ververica.com>
wrote:

> Are you sure the UI is the bottleneck? The UI gets back a JSON
> representation of this data structure:
>
> https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java
>
>
> All samples from the individual subtasks get merged in the backend, the UI
> just renders this one data structure. The complexity is *O(s)*, where *s*
> is the number of elements on the stack, not *O(s*n)* where *n* is the
> number of subtasks. Since all subtasks execute the same code, *s* is
> expected to be stable regardless of the parallelism.
>
> Best,
> Alexander Fedulov
>
> On Fri, Feb 11, 2022 at 11:01 AM David Morávek <d...@apache.org> wrote:
>
> > There are already tools [1] that simplify this for the user.
> >
> > I honestly don't know, it feels like it can bring more problems that
> actual
> > benefits as this heavily relies on the environment. It can easily break
> for
> > some users, eg. because of the kernel settings; their architecture might
> > not be supported; Also we'd need to go an extra mile regarding the
> > security.
> >
> > Considering there are already other tools that are specifically designed
> > for this (such as [1]), I personally don't feel that this should be part
> of
> > Flink.
> >
> > [1] https://github.com/yahoo/kubectl-flame
> >
> >
> > On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid>
> > wrote:
> >
> > > Our flink application is on k8s.Yes, user can use the async-profiler
> > > directly, but it is not convenient for user, who should download the
> jars
> > > and need to know how to use it. And some users don’t know the tool.if
> we
> > > integrate it, user will benefit a lot.
> > >
> > > On 2022/01/26 18:56:17 David Morávek wrote:
> > > > I'd second to Alex's concerns. Is there a reason why you can't use
> the
> > > > async-profiler directly? In what kind of environment are your Flink
> > > > clusters running (YARN / k8s / ...)?
> > > >
> > > > Best,
> > > > D.
> > > >
> > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov <
> al...@ververica.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Jacky,
> > > >>
> > > >> Could you please clarify what kind of *problems* you experience with
> > the
> > > >> large parallelism? You referred to D3, is it something related to
> > > rendering
> > > >> on the browser side or is it about the samples collection process?
> > Were
> > > you
> > > >> able to identify the bottleneck?
> > > >>
> > > >> Fundamentally I have some concerns regarding the proposed approach:
> > > >> 1. Calling shell scripts triggered via the web UI is a security
> > concern
> > > and
> > > >> it needs to be evaluated carefully if it could introduce any
> > unexpected
> > > >> attack vectors (depending on the implementation, passed parameters
> > etc.)
> > > >> 2. My understanding is that the async-profiler implementation is
> > > >> system-dependent. How do you propose to handle multiple
> architectures?
> > > >> Would you like to ship each available implementation within Flink?
> [1]
> > > >> 3. Do you plan to make use of full async-profiler features including
> > > native
> > > >> calls sampling with perf_events? If so, the issue I see is that some
> > > >> environments restrict ptrace calls by default [2]
> > > >>
> > > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
> > > >>
> > > >>
> > > >> Best,
> > > >> Alexander Fedulov
> > > >>
> > > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 <li...@icloud.com.invalid>
> wrote:
> > > >>
> > > >>> This is an expected feature, as we also experienced browser crashes
> > on
> > > >>> existing operator-level flame graphs
> > > >>>
> > > >>> Best,
> > > >>> Echo Lee
> > > >>>
> > > >>>> 在 2022年1月24日,下午6:16,David Morávek <da...@gmail.com> 写道:
> > > >>>>
> > > >>>> Hi Jacky,
> > > >>>>
> > > >>>> The link seems to be broken, here is the correct one [1].
> > > >>>>
> > > >>>> [1]
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > > >>>>
> > > >>>> Best,
> > > >>>> D.
> > > >>>>
> > > >>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid>
> > > >>> wrote:
> > > >>>>>
> > > >>>>> Hi All,
> > > >>>>> &nbsp; &nbsp; I would like to start the discussion on FLIP-213 <
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs&gt
> > > >>>>> ;
> > > >>>>> &nbsp;which aims to provide taskmanager level(process level)
> flame
> > > >> graph
> > > >>>>> by async profiler, which is most popular tool in java
> performance.
> > > and
> > > >>> the
> > > >>>>> arthas and intellij both use it.&nbsp;
> > > >>>>> And we support it in our ant group company.
> > > >>>>> &nbsp; &nbsp;And&nbsp;Flink supports FLIP-165: Operator's Flame
> > > Graphs
> > > >>>>> now. and it draw flame graph by the&nbsp;front-end
> > > >>>>> libraries&nbsp;d3-flame-graph, which has some problem in&nbsp;
> jobs
> > > >>>>> of&nbsp;large of parallelism.
> > > >>>>> &nbsp; &nbsp;Please be aware that the FLIP wiki area is not fully
> > > done
> > > >>>>> since i don't konw whether it will accept by
> > > >> flink&nbsp;community.&nbsp;
> > > >>>>> &nbsp; &nbsp;Feel free to add your thoughts to make this feature
> > > >>> better! i
> > > >>>>> am looking forward&nbsp; to all your response. Thanks too much!
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> Best Jacky Lau
> > > >>>
> > > >>
> > >
> >
>

Reply via email to