Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

Yun Tang Tue, 10 Oct 2023 00:15:40 -0700

Hi Jing,

First, developers would accept the little overhead when debugging the 
performance issues. Secondly, according to the async-profiler's report, it 
should only have less than 3% extra cost. I also verified it with a 
CPU-intensive ETL Flink job with this profiler, it does not show any obvious 
performance regression during profiling.




[1] https://github.com/async-profiler/async-profiler/issues/14

Best
Yun Tang

________________________________
From: Jing Ge <j...@ververica.com.INVALID>
Sent: Tuesday, October 10, 2023 12:05
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler 
on taskmanagers

Thanks Yun for your clarification. Especially thanks Rui for your
informative elaboration. Since we will have two flame graphs, I would
suggest updating Flink documentation to help users understand it and know
when to use which one. The content provided by Rui is already a really good
starting point. Like it!

Yun

Theoretically yeah, I am with you. Practically, it will help get more
attraction and attention, if you can provide the real (test) metrics of how
"extremely light" means, e.g. engineering mindset. Otherwise, each serious
user will have to evaluate the performance on her/his own before using it
in production. WDYT?

Best regards,
Jing

On Tue, Oct 10, 2023 at 5:29 AM Yun Tang <myas...@live.com> wrote:

> Hi Jing,
>
> I will answer current questions.
>
> > 1. will it replace the current flame graph, i.e. the current flame graph
> will be deprecated and removed?
>
> Although I think the new java profiler introduced in FLIP-375 is more
> powerful, just as Rui has replied, I don't think it could replace current
> flame graph totally.
>
>
> > 2.does it make sense to provide the performance difference between enable
> and disable it?
>
> The new java profiler would not introduce any performance impact after we
> enable it, it will only start work when we trigger the profiling. And from
> our experiences, the overhead of profiling is extremely light.
>
>
>
> For Rui's question:
>
> > Are all process-level flamegraphs stored at BlobStore? Are they
> maintained by JobManager after sampling? Is there cleanup strategy? Or
> max-save-count strategy?
>
> Yes, we use blobstore to store the process-level flamegraph-files and
> maintained on taskmanager side. They flamegraph-files will be cleanup
> automatically once reached to rest.profiling.history-size.
>
> Best
> Yun Tang
>
>
>
> ________________________________
> From: Rui Fan <1996fan...@gmail.com>
> Sent: Tuesday, October 10, 2023 10:10
> To: dev@flink.apache.org <dev@flink.apache.org>
> Subject: Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java
> profiler on taskmanagers
>
> Hi Jing,
>
> > 1. will it replace the current flame graph, i.e. the current flame graph
> will be deprecated and removed?
>
> I think the current flame graph cannot be removed.
>
> As a core contributor to the current flame graph, and I use it almost
> every week. I would like to clarify the difference between the current
> flame graph and the flame graph proposed by FLIP-375.
>
> @The current flame graph
>
> The current flame graph is the operator level or task level, when one
> operator is the bottleneck of current job. We can see the current
> flamegraph to check what the operator is doing.
>
> It includes three types: On-CPU, Off-CPU and Mixed-Type. The Mixed-Type
> is very useful, it can detect why operator is slow even if the operator
> doesn't use CPU. For example, the operator is blocked on querying hbase.
>
> It just support the task thread, it means it cannot detect the cpu usage of
> other threads, such as: RocksDB Flush or compaction. This's the
> limitation of current flamegraph.
>
> @The flame graph proposed by FLIP-375.
>
> The flamegraph proposed by FLIP-375 works on process level, such as
> JobManager or TaskManager, so it can monitor all threads. Such as:
> rocksdb background threads.
>
> When the CPU usage of one TM is high, and all tasks are not busy.
> The new flamegraph will be useful.
>
> Back to the question: It includes task or operator thread,
> why the current flamegraph is still needed?
>
> 1. The flamegraph of process level cannot easily distinguish tasks.
> Especially if there are multiple slots in a TM, and different subtasks of
> the
> same task running in multiple slots, their stacks are very similar.
>
> 2. The Mixed-Type of current flamegraph may not be replaced by the
> process-level flame graph.
>
> Please correct me if anything is wrong, thanks~
>
> Hi Yu,
>
> > Jobmanager allows the user to download the results of the corresponding
> files on taskmanager with the blob service.
>
> Are all process-level flamegraphs stored at BlobStore?
> Are they maintained by JobManager after sampling?
> Is there cleanup strategy? Or max-save-count strategy?
>
> Best,
> Rui
>
>
> On Tue, Oct 10, 2023 at 1:24 AM Jing Ge <j...@ververica.com.invalid>
> wrote:
>
> > Hi Yu, Hi Yun,
> >
> > Brilliant idea! People are keen to use it. Thanks for your proposal! I
> was
> > wondering:
> >
> > 1. will it replace the current flame graph, i.e. the current flame graph
> > will be deprecated and removed?
> > 2. does it make sense to provide the performance difference between
> enable
> > and disable it?
> >
> > Best regards,
> > Jing
> >
> > On Mon, Oct 9, 2023 at 1:50 PM Yu Chen <yuchen.e...@gmail.com> wrote:
> >
> > > Hi zhanghao,
> > >
> > > Yes, agree with you. We'll take Jobmanager into consideration and
> update
> > > the FLIP later!
> > >
> > > Best,
> > > Yu Chen
> > >
> > > Zhanghao Chen <zhanghao.c...@outlook.com> 于2023年10月9日周一 19:22写道：
> > >
> > > > Hi Yun and Yu,
> > > >
> > > > Thanks for driving this. This would definitely help users identify
> > > > performance bottlenecks, especially for the cases where the
> bottleneck
> > > lies
> > > > in the system stack (e.g. GC), and big +1 for the downloadable
> > flamegraph
> > > > to ease sharing. I'm wondering if we could add this for the job
> manager
> > > as
> > > > well. In the OLAP scenario and sometimes in the streaming scenario
> > (when
> > > > there're some heavy operations during execution plan generation or in
> > > > operator coordinators), the JM can have bottleneck as well.
> > > >
> > > > Best,
> > > > Zhanghao Chen
> > > > ________________________________
> > > > From: Yu Chen <yuchen.e...@gmail.com>
> > > > Sent: Monday, October 9, 2023 17:24
> > > > To: dev@flink.apache.org <dev@flink.apache.org>
> > > > Subject: [DISCUSS] FLIP-375: Built-in cross-platform powerful java
> > > > profiler on taskmanagers
> > > >
> > > > Hi all,
> > > >
> > > > Yun Tang and I are opening this thread to discuss our proposal to
> > > integrate
> > > > async-profiler's capabilities for profiling taskmananger (e.g.,
> > > generating
> > > > flame graphs) in the Flink Web [1].
> > > >
> > > >
> > > > Currently, Flink provides ThreadDump and Operator-Level Flame Graphs
> by
> > > > sampling task threads. The results generated in such way missing the
> > > > relevant stack of java threads and system calls. The
> async-profiler[2]
> > > is a
> > > > low-overhead sampling profiler for Java, but the steps to use it in
> the
> > > > production environment are cumbersome and suffer from permissions and
> > > > security risks.
> > > >
> > > > Therefore, we propose adding rest APIs to provide the capability to
> > > invoke
> > > > async-profiler on multiple platforms through JNI, which can be easily
> > > > operated on Web UI. This enhancement will improve the efficiency and
> > > > experience of Flink users in identifying performance bottlenecks.
> > > >
> > > >
> > > >
> > > > Please refer to the FLIP document for more details about the proposed
> > > > design
> > > > and implementation. We welcome any feedback and opinions on this
> > > proposal.
> > > >
> > > >
> > > >
> > > > [1] FLIP-375: Built-in cross-platform powerful java profiler on
> > > > taskmanagers - Apache Flink - Apache Software Foundation
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-375%3A+Built-in+cross-platform+powerful+java+profiler+on+taskmanagers
> > > > >
> > > >
> > > > [2] GitHub - async-profiler/async-profiler: Sampling CPU and HEAP
> > > profiler
> > > > for Java featuring AsyncGetCallTrace + perf_events
> > > > <https://github.com/async-profiler/async-profiler>
> > > >
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Yun Tang and Yu Chen
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-375: Built-in cross-platform powerful java profiler on taskmanagers

Reply via email to