Thanks Rui,

Appreciate your detailed response.

On enabling by default: I agree that optimizing memory usage first is the
better approach. I'll pivot to implementing the disk-based storage solution
to address the root cause rather than working around it with configuration
changes. Let me make those changes to the doc.

On configuration refactoring: Just to clarify, the goal is to support
runtime updates (e.g., triggering sampling via REST/UI without restarts). I
want to make sure I understand your suggestion correctly. When you mention
"refactor the internal implementation to use job-level configs to override
cluster-level ones" for flamegraph parameters, could you elaborate?

So I’m trying to understand what the preferred path would be if we still
want to support dynamic flamegraph toggling.
Currently, flamegraph configs are cluster-level (RestOptions.FLAMEGRAPH_*)
and read during cluster startup. Are you suggesting:
1. Move to job-level configs: Create (eg: execution.flamegraph.stack-depth
options) that jobs can set to override cluster defaults?
2. Change config reading timing: Following the FLINK-37985 pattern, read
flamegraph configs during sampling operations (from job's Configuration)
instead of caching them at cluster startup?
3. Use FLIP-530 for updates: Once moved to job-level, use the standard job
config API instead of custom REST endpoints?

Wouldn't changing these job-level configs still require job restarts per
FLIP-530's design?

Thank you :)

On Fri, Aug 22, 2025 at 1:09 AM Rui Fan <1996fan...@gmail.com> wrote:

> Hey Poorvank,
>
> Thanks for driving this discussion.
>
> As a core developer of flink flamegraph, I would  -1 for this proposal.
> 2 concerns are mentioned in our first discussion in Slack, and
> these 2 concerns are raised again by Danny and Gyula.
>
> 1. Why not enable it by default?
>
> As I know, many Flink users and companies have enabled flamegraphs
> in production. While I acknowledge the concern about JobManager
> memory pressure at high parallelism, I believe the better approach is
> to address the root cause by optimizing memory usage directly.
> For instance, we could store the flamegraph data on local disk.
>
> By optimizing it first and then enabling it by default, we can provide
> a true out-of-the-box experience for Flink users, rather than requiring
> them to manually tweak configuration options.
>
> 2. On dynamic configuration
>
> The goal of FLIP-530 is powerful, but it's not a silver bullet for all
> configuration options. If dynamic updates for flamegraph parameters
> are necessary, the feasible solution is to refactor the internal
> implementation to use job-level configs to override cluster-level ones.
>
> I suspect the flamegraph parameters are not the only ones that can't be
> updated dynamically via this mechanism; this likely affects all
> cluster-level
> options and any job options read before the JobGraph is created.
> For these cases, a consistent refactoring approach is far better than
> introducing separate, ad-hoc REST APIs for each one. For example,
> FLINK-37985[1] is an example, which refactors some checkpoint config
> option calls from before JobGraph generation to after JobGraph generation.
>
> [1] https://issues.apache.org/jira/browse/FLINK-37985
>
> Best,
> Rui
>
>
> On Tue, Aug 19, 2025 at 3:33 PM Poorvank Bhatia <puravbhat...@gmail.com>
> wrote:
>
> > Hey Danny,
> > Thanks for the feedback.
> >
> > 1. Instead of adding a FlameGraph specific REST API did you consider
> adding
> > a more general config API? Similar to that of the dynamical job
> > configuration [1] endpoint but for cluster configs instead of job? We
> could
> > add an allow list of supported config options and start with Flamegraph.
> > This would allow other configs to use the API in the future without
> adding
> > more APIs.
> >
> > I think FLIP-530 works for job configs because jobs can safely restart
> from
> > checkpoints with updated config, scoped to a single job. Cluster configs,
> > however, lack a restart boundary — changes must be applied to shared
> > infrastructure without downtime, affecting all jobs. I analyzed the
> cluster
> > config categories; only a handful (like *rest.flamegraph.*,
> > web.refresh-interval, cluster.thread-dump.stacktrace-max-depth*) are safe
> > to update dynamically. Most (*e.g., rpc.port,
> > taskmanager.memory.process.size, security.ssl.**) are startup-only or
> > deeply cached and would require service restarts.
> > And  the flamegraph configs are isolated, already dynamic in design.
> Hence
> > i couldn't find a common pattern config.
> >
> > 2. nit: As for the UI, I would prefer for the settings to take up less
> > space. The new options are at the top of the view, even when not
> expanded.
> >
> > Makes sense. Will update it.
> >
> >
> > On Tue, Aug 19, 2025 at 6:12 PM Poorvank Bhatia <puravbhat...@gmail.com>
> > wrote:
> >
> > > Hello Gyula,
> > > Thanks for the suggestion! Enabling flamegraphs by default could indeed
> > > improve visibility, and in many stable environments, the passive
> overhead
> > > is minimal. However, based on our use cases, there are a few practical
> > > reasons we’ve opted to keep them disabled by default:
> > >
> > >    1. Memory & GC Behavior During Sampling:  When the flamegraph tab is
> > >    opened in the UI, VertexThreadInfoTracker begins continuous stack
> > trace
> > >    sampling every statsRefreshInterval (default 60s), with each sample
> > >    containing ThreadInfoSample objects containing full
> > StackTraceElement[]
> > >    arrays. This sampling introduces non-trivial memory pressure,
> > especially in
> > >    high-parallelism scenarios. So data for all the stack traces across
> > task
> > >    managers is then stored on the JobManager heap within
> > >    VertexThreadInfoTracker (with each entry
> > >    containing ThreadInfoSample instances). We observed that this
> > structure
> > >    accumulates rapidly with high parallelism (>1000) and deep stack
> > sampling,
> > >    causing memory issues in JM. (Memory retention until cleanUpInterval
> > >    (default 10 minutes)). The major issue is that JM OOM affects the
> > entire
> > >    cluster availability :(
> > >    2. Flink doesn't persist flamegraph data: Flamegraph samples are
> held
> > >    entirely in memory. For future iterations, we’re considering storing
> > them
> > >    temporarily to local disk or external storage (but that requires
> > >    significant changes), which would decouple the UI tab from memory
> > pressure.
> > >    3. Config values still come from RestOptions: Even if we enable
> > >    flamegraphs by default, the sampling parameters (e.g., numSamples,
> > >    stackDepth, delayBetweenSamples) are still initialized via
> > RestOptions on
> > >    JobManager startup. Without dynamic REST reconfiguration (as
> > proposed),
> > >    users would still need to restart the cluster to change them.
> > >
> > > Let me know if that makes sense.
> > >
> > > On Tue, Aug 19, 2025 at 4:59 PM Gyula Fóra <gyula.f...@gmail.com>
> wrote:
> > >
> > >> Hey!
> > >> Instead of adding new logic for this, can we make the flamegraphs
> > enabled
> > >> by default?
> > >>
> > >> Based on my experience almost everyone wants it enabled , doesn't seem
> > to
> > >> add any overhead when they are not actually checked on the UI
> > >>
> > >> Cheers,
> > >> Gyula
> > >>
> > >> On Tue, Aug 19, 2025 at 1:27 PM Danny Cranmer <
> dannycran...@apache.org>
> > >> wrote:
> > >>
> > >> > Hello Poorvank,
> > >> >
> > >> > Thanks for driving this, I can understand how dynamically enabling
> > >> > FlameGraphs can be powerful, so +1 on the general idea.
> > >> >
> > >> > 1. Instead of adding a FlameGraph specific REST API did you consider
> > >> adding
> > >> > a more general config API? Similar to that of the dynamical job
> > >> > configuration [1] endpoint but for cluster configs instead of job?
> We
> > >> could
> > >> > add an allow list of supported config options and start with
> > Flamegraph.
> > >> > This would allow other configs to use the API in the future without
> > >> adding
> > >> > more APIs.
> > >> > 2. nit: As for the UI, I would prefer for the settings to take up
> less
> > >> > space. The new options are at the top of the view, even when not
> > >> expanded.
> > >> >
> > >> > Thanks,
> > >> > Danny
> > >> >
> > >> > [1]
> > >> >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > >> >
> > >> > On Tue, Aug 12, 2025 at 4:31 AM Poorvank Bhatia <
> > puravbhat...@gmail.com
> > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi all,
> > >> > >
> > >> > > I would like to open a discussion proposing the ability to enable
> > >> > > flamegraphs at runtime and make their configuration i.e number of
> > >> > samples,
> > >> > > delay between samples, and stack depth *dynamically adjustable via
> > the
> > >> > Web
> > >> > > UI*, without requiring any job or cluster restarts.
> > >> > >
> > >> > > As of now, enabling flamegraphs requires setting
> > >> > > *rest.flamegraph.enabled=true* and restarting the Job. This is not
> > >> ideal
> > >> > > for debugging live issues, especially in production environments.
> > >> > >
> > >> > > I discussed this idea offline with Roman Khachatryan (author of
> > >> FLIP-530
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > >> > > >),
> > >> > > Rui Fan, and Arvid Heise. While Rui noted that this could
> > potentially
> > >> > align
> > >> > > with FLIP-530’s direction, Roman confirmed that it’s better
> handled
> > >> as a
> > >> > > separate effort, since FLIP-530
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-530%3A+Dynamic+job+configuration
> > >> > > >
> > >> > > is scoped to job-level config, whereas this proposal addresses
> > >> > > cluster-level observability via RestOptions.
> > >> > >
> > >> > > For Design Details, Please refer: Dynamic Flamegraph via UI
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> https://docs.google.com/document/d/1A9fLFgXMGxQQn6X8WCv7mLL21AnLqrDFvLSHnUg8rLA/edit?tab=t.0#heading=h.s351fc464ma6
> > >> > > >
> > >> > >
> > >> > > I’ve attached a short demo to help visualize the proposed feature
> > and
> > >> > > gather feedback. Demo
> > >> > > <
> > >> > >
> > >> >
> > >>
> >
> https://drive.google.com/file/d/1iik6aOc2uc9sFlHFlT8YDX5TKFdoD15u/view?usp=sharing
> > >> > > >
> > >> > >
> > >> > > Looking forward to your thoughts.
> > >> > >
> > >> > > Regards,
> > >> > >
> > >> > > Poorvank Bhatia
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to