[
https://issues.apache.org/jira/browse/FLINK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-39688.
------------------------------
Resolution: Fixed
merged to main e4c56b2be3d844d4c12176f281972a8cc3de49b2
> Share a single PluginManager across operator and webhook startup
> ----------------------------------------------------------------
>
> Key: FLINK-39688
> URL: https://issues.apache.org/jira/browse/FLINK-39688
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.15.0
> Reporter: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.16.0
>
>
> h1. Summary
> The Flink Kubernetes Operator constructs a brand-new {{PluginManager}} four
> separate times during operator JVM startup, plus two more times during
> webhook JVM startup. Each call independently rescans
> {{{}/opt/flink/plugins{}}}, builds isolated child class loaders for every
> plugin subdirectory, and emits a duplicate {{"Plugin loader with ID not
> found, creating it: <subdir>"}} batch in the logs. This refactor consolidates
> plugin discovery onto a single shared {{PluginManager}} instance per JVM.
> h1. Problem
> Five different consumers of the Flink plugin SPI each call
> {{PluginUtils.createPluginManagerFromRootFolder(...)}} on their own:
> * {{OperatorMetricUtils.initOperatorMetrics}} (metric reporters)
> * {{ValidatorUtils.discoverValidators}} (validators)
> * {{ListenerUtils.discoverListeners}} (resource listeners)
> * {{FlinkOperator}} constructor (file system factories, via
> {{{}FileSystem.initialize{}}})
> * {{MutatorUtils.discoverMutators}} (mutating webhook plugins)
> In the operator process this produces four full plugin scans during startup.
> In the webhook process it produces two more. Each scan walks the plugin
> folder, opens every JAR manifest, and instantiates a fresh isolated
> {{URLClassLoader}} per plugin subdirectory. Those class loaders are retained
> for the lifetime of the JVM because the loaded SPI factories pin them.
> h1. Symptoms
> h2. Log noise on every startup
> A representative operator startup log shows the same eight
> {{flink-metrics-*}} subdirectories being announced four times in a row,
> drowning out meaningful startup messages:
> {noformat}
> o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found,
> creating it: flink-metrics-slf4j
> o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found,
> creating it: flink-metrics-statsd
> ... (repeated 4x)
> {noformat}
> h2. Wasted memory and I/O
> Four redundant {{PluginManager}} instances mean four times the class loaders
> for the same JAR set, all retained for the JVM lifetime. The repeated folder
> walks and JAR opens are also pure overhead at startup.
> h2. Latent class loader identity hazard
> Today each call site loads a different SPI ({{{}MetricReporterFactory{}}},
> {{{}FileSystemFactory{}}}, {{{}FlinkResourceValidator{}}},
> {{{}FlinkResourceListener{}}}, {{{}FlinkResourceMutator{}}}), so no plugin
> class is loaded twice through different class loaders. Having a plugin
> participating in two SPIs, the duplicated managers would silently produce two
> distinct {{Class}} instances for the same plugin, causing cast failures,
> double-initialization, or singleton drift. Sharing a single manager removes
> this whole class of bug.
> h2. Inconsistent class loader policy across consumers
> Only the listener call site enriches the configuration with parent-first
> patterns for {{io.fabric8}} and {{com.fasterxml}} so listener plugins share
> the operator's fabric8 client and Jackson. The other four consumers do not,
> which means a metric reporter, file system factory, validator, or mutator
> that bundles its own version of those libraries can end up with a duplicate
> fabric8 client (its own informers, its own connection state) or
> version-mismatched Jackson. Centralizing the policy makes it consistent
> across every plugin type.
> h1. Goal
> Build the {{PluginManager}} exactly once per JVM (operator and webhook) and
> pass it explicitly into every consumer that performs plugin discovery.
> Promote the listener-only parent-first patterns to a process-wide default so
> all plugin types share the operator's fabric8 client and Jackson.
> h1. Scope
> * Operator JVM: collapse four {{PluginManager}} constructions into one.
> * Webhook JVM: collapse two into one.
> * Class loader policy: apply the {{io.fabric8}} and {{com.fasterxml}}
> parent-first patterns to all plugin types, not just listeners.
> This is startup-only. The reconcile loop and runtime behavior are unaffected.
> No CRD, public API, or dependency changes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)