Hello Gyula,
Thanks for the suggestion! Enabling flamegraphs by default could indeed
improve visibility, and in many stable environments, the passive overhead
is minimal. However, based on our use cases, there are a few practical
reasons we’ve opted to keep them disabled by default:

   1. Memory & GC Behavior During Sampling:  When the flamegraph tab is
   opened in the UI, VertexThreadInfoTracker begins continuous stack trace
   sampling every statsRefreshInterval (default 60s), with each sample
   containing ThreadInfoSample objects containing full StackTraceElement[]
   arrays. This sampling introduces non-trivial memory pressure, especially in
   high-parallelism scenarios. So data for all the stack traces across task
   managers is then stored on the JobManager heap within
   VertexThreadInfoTracker (with each entry
   containing ThreadInfoSample instances). We observed that this structure
   accumulates rapidly with high parallelism (>1000) and deep stack sampling,
   causing memory issues in JM. (Memory retention until cleanUpInterval
   (default 10 minutes)). The major issue is that JM OOM affects the entire
   cluster availability :(
   2. Flink doesn't persist flamegraph data: Flamegraph samples are held
   entirely in memory. For future iterations, we’re considering storing them
   temporarily to local disk or external storage (but that requires
   significant changes), which would decouple the UI tab from memory pressure.
   3. Config values still come from RestOptions: Even if we enable
   flamegraphs by default, the sampling parameters (e.g., numSamples,
   stackDepth, delayBetweenSamples) are still initialized via RestOptions on
   JobManager startup. Without dynamic REST reconfiguration (as proposed),
   users would still need to restart the cluster to change them.

Let me know if that makes sense.

On Tue, Aug 19, 2025 at 4:59 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hey!
> Instead of adding new logic for this, can we make the flamegraphs enabled
> by default?
>
> Based on my experience almost everyone wants it enabled , doesn't seem to
> add any overhead when they are not actually checked on the UI
>
> Cheers,
> Gyula
>
> On Tue, Aug 19, 2025 at 1:27 PM Danny Cranmer <dannycran...@apache.org>
> wrote:
>
> > Hello Poorvank,
> >
> > Thanks for driving this, I can understand how dynamically enabling
> > FlameGraphs can be powerful, so +1 on the general idea.
> >
> > 1. Instead of adding a FlameGraph specific REST API did you consider
> adding
> > a more general config API? Similar to that of the dynamical job
> > configuration [1] endpoint but for cluster configs instead of job? We
> could
> > add an allow list of supported config options and start with Flamegraph.
> > This would allow other configs to use the API in the future without
> adding
> > more APIs.
> > 2. nit: As for the UI, I would prefer for the settings to take up less
> > space. The new options are at the top of the view, even when not
> expanded.
> >
> > Thanks,
> > Danny
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> >
> > On Tue, Aug 12, 2025 at 4:31 AM Poorvank Bhatia <puravbhat...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to open a discussion proposing the ability to enable
> > > flamegraphs at runtime and make their configuration i.e number of
> > samples,
> > > delay between samples, and stack depth *dynamically adjustable via the
> > Web
> > > UI*, without requiring any job or cluster restarts.
> > >
> > > As of now, enabling flamegraphs requires setting
> > > *rest.flamegraph.enabled=true* and restarting the Job. This is not
> ideal
> > > for debugging live issues, especially in production environments.
> > >
> > > I discussed this idea offline with Roman Khachatryan (author of
> FLIP-530
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > > >),
> > > Rui Fan, and Arvid Heise. While Rui noted that this could potentially
> > align
> > > with FLIP-530’s direction, Roman confirmed that it’s better handled as
> a
> > > separate effort, since FLIP-530
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > > >
> > > is scoped to job-level config, whereas this proposal addresses
> > > cluster-level observability via RestOptions.
> > >
> > > For Design Details, Please refer: Dynamic Flamegraph via UI
> > > <
> > >
> >
> https://docs.google.com/document/d/1A9fLFgXMGxQQn6X8WCv7mLL21AnLqrDFvLSHnUg8rLA/edit?tab=t.0#heading=h.s351fc464ma6
> > > >
> > >
> > > I’ve attached a short demo to help visualize the proposed feature and
> > > gather feedback. Demo
> > > <
> > >
> >
> https://drive.google.com/file/d/1iik6aOc2uc9sFlHFlT8YDX5TKFdoD15u/view?usp=sharing
> > > >
> > >
> > > Looking forward to your thoughts.
> > >
> > > Regards,
> > >
> > > Poorvank Bhatia
> > >
> >
>

Reply via email to