Pyroscope[1] and Parca[2] are other options for less-intrusive profiling (& great fits for k8s) that move the burden from Flink & its UI to tools that are purpose-built for this use case. Perhaps we could investigate what it would take (if anything) to make Flink compatible with those?
Best, Austin [1]: https://pyroscope.io/ [2]: https://www.parca.dev/ On Fri, Feb 11, 2022 at 8:33 AM Alexander Fedulov <alexan...@ververica.com> wrote: > Are you sure the UI is the bottleneck? The UI gets back a JSON > representation of this data structure: > > https://github.com/apache/flink/blob/2e21321f9c9d9aada7e4ad8ca90d915c34f58015/flink-runtime/src/main/java/org/apache/flink/runtime/webmonitor/threadinfo/JobVertexFlameGraph.java > > > All samples from the individual subtasks get merged in the backend, the UI > just renders this one data structure. The complexity is *O(s)*, where *s* > is the number of elements on the stack, not *O(s*n)* where *n* is the > number of subtasks. Since all subtasks execute the same code, *s* is > expected to be stable regardless of the parallelism. > > Best, > Alexander Fedulov > > On Fri, Feb 11, 2022 at 11:01 AM David Morávek <d...@apache.org> wrote: > > > There are already tools [1] that simplify this for the user. > > > > I honestly don't know, it feels like it can bring more problems that > actual > > benefits as this heavily relies on the environment. It can easily break > for > > some users, eg. because of the kernel settings; their architecture might > > not be supported; Also we'd need to go an extra mile regarding the > > security. > > > > Considering there are already other tools that are specifically designed > > for this (such as [1]), I personally don't feel that this should be part > of > > Flink. > > > > [1] https://github.com/yahoo/kubectl-flame > > > > > > On Fri, Feb 11, 2022 at 9:28 AM Jacky Lau <281293...@qq.com.invalid> > > wrote: > > > > > Our flink application is on k8s.Yes, user can use the async-profiler > > > directly, but it is not convenient for user, who should download the > jars > > > and need to know how to use it. And some users don’t know the tool.if > we > > > integrate it, user will benefit a lot. > > > > > > On 2022/01/26 18:56:17 David Morávek wrote: > > > > I'd second to Alex's concerns. Is there a reason why you can't use > the > > > > async-profiler directly? In what kind of environment are your Flink > > > > clusters running (YARN / k8s / ...)? > > > > > > > > Best, > > > > D. > > > > > > > > On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov < > al...@ververica.com > > > > > > > wrote: > > > > > > > >> Hi Jacky, > > > >> > > > >> Could you please clarify what kind of *problems* you experience with > > the > > > >> large parallelism? You referred to D3, is it something related to > > > rendering > > > >> on the browser side or is it about the samples collection process? > > Were > > > you > > > >> able to identify the bottleneck? > > > >> > > > >> Fundamentally I have some concerns regarding the proposed approach: > > > >> 1. Calling shell scripts triggered via the web UI is a security > > concern > > > and > > > >> it needs to be evaluated carefully if it could introduce any > > unexpected > > > >> attack vectors (depending on the implementation, passed parameters > > etc.) > > > >> 2. My understanding is that the async-profiler implementation is > > > >> system-dependent. How do you propose to handle multiple > architectures? > > > >> Would you like to ship each available implementation within Flink? > [1] > > > >> 3. Do you plan to make use of full async-profiler features including > > > native > > > >> calls sampling with perf_events? If so, the issue I see is that some > > > >> environments restrict ptrace calls by default [2] > > > >> > > > >> [1] https://github.com/jvm-profiling-tools/async-profiler#download > > > >> [2] > > > >> > > > >> > > > > > > https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces > > > >> > > > >> > > > >> Best, > > > >> Alexander Fedulov > > > >> > > > >> On Wed, Jan 26, 2022 at 1:59 PM 李森 <li...@icloud.com.invalid> > wrote: > > > >> > > > >>> This is an expected feature, as we also experienced browser crashes > > on > > > >>> existing operator-level flame graphs > > > >>> > > > >>> Best, > > > >>> Echo Lee > > > >>> > > > >>>> 在 2022年1月24日,下午6:16,David Morávek <da...@gmail.com> 写道: > > > >>>> > > > >>>> Hi Jacky, > > > >>>> > > > >>>> The link seems to be broken, here is the correct one [1]. > > > >>>> > > > >>>> [1] > > > >>>> > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs > > > >>>> > > > >>>> Best, > > > >>>> D. > > > >>>> > > > >>>>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <28...@qq.com.invalid> > > > >>> wrote: > > > >>>>> > > > >>>>> Hi All, > > > >>>>> I would like to start the discussion on FLIP-213 < > > > >>>>> > > > >>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs> > > > >>>>> ; > > > >>>>> which aims to provide taskmanager level(process level) > flame > > > >> graph > > > >>>>> by async profiler, which is most popular tool in java > performance. > > > and > > > >>> the > > > >>>>> arthas and intellij both use it. > > > >>>>> And we support it in our ant group company. > > > >>>>> And Flink supports FLIP-165: Operator's Flame > > > Graphs > > > >>>>> now. and it draw flame graph by the front-end > > > >>>>> libraries d3-flame-graph, which has some problem in > jobs > > > >>>>> of large of parallelism. > > > >>>>> Please be aware that the FLIP wiki area is not fully > > > done > > > >>>>> since i don't konw whether it will accept by > > > >> flink community. > > > >>>>> Feel free to add your thoughts to make this feature > > > >>> better! i > > > >>>>> am looking forward to all your response. Thanks too much! > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> Best Jacky Lau > > > >>> > > > >> > > > > > >