Dennis-Mircea Ciupitu created FLINK-39688:
---------------------------------------------
Summary: Share a single PluginManager across operator and webhook
startup
Key: FLINK-39688
URL: https://issues.apache.org/jira/browse/FLINK-39688
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.15.0
Reporter: Dennis-Mircea Ciupitu
Fix For: kubernetes-operator-1.16.0
h1. Summary
The Flink Kubernetes Operator constructs a brand-new {{PluginManager}} four
separate times during operator JVM startup, plus two more times during webhook
JVM startup. Each call independently rescans {{{}/opt/flink/plugins{}}}, builds
isolated child class loaders for every plugin subdirectory, and emits a
duplicate {{"Plugin loader with ID not found, creating it: <subdir>"}} batch in
the logs. This refactor consolidates plugin discovery onto a single shared
{{PluginManager}} instance per JVM.
h1. Problem
Five different consumers of the Flink plugin SPI each call
{{PluginUtils.createPluginManagerFromRootFolder(...)}} on their own:
* {{OperatorMetricUtils.initOperatorMetrics}} (metric reporters)
* {{ValidatorUtils.discoverValidators}} (validators)
* {{ListenerUtils.discoverListeners}} (resource listeners)
* {{FlinkOperator}} constructor (file system factories, via
{{{}FileSystem.initialize{}}})
* {{MutatorUtils.discoverMutators}} (mutating webhook plugins)
In the operator process this produces four full plugin scans during startup. In
the webhook process it produces two more. Each scan walks the plugin folder,
opens every JAR manifest, and instantiates a fresh isolated {{URLClassLoader}}
per plugin subdirectory. Those class loaders are retained for the lifetime of
the JVM because the loaded SPI factories pin them.
h1. Symptoms
h2. Log noise on every startup
A representative operator startup log shows the same eight {{flink-metrics-*}}
subdirectories being announced four times in a row, drowning out meaningful
startup messages:
{noformat}
o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found,
creating it: flink-metrics-slf4j
o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found,
creating it: flink-metrics-statsd
... (repeated 4x)
{noformat}
h2. Wasted memory and I/O
Four redundant {{PluginManager}} instances mean four times the class loaders
for the same JAR set, all retained for the JVM lifetime. The repeated folder
walks and JAR opens are also pure overhead at startup.
h2. Latent class loader identity hazard
Today each call site loads a different SPI ({{{}MetricReporterFactory{}}},
{{{}FileSystemFactory{}}}, {{{}FlinkResourceValidator{}}},
{{{}FlinkResourceListener{}}}, {{{}FlinkResourceMutator{}}}), so no plugin
class is loaded twice through different class loaders. Having a plugin
participating in two SPIs, the duplicated managers would silently produce two
distinct {{Class}} instances for the same plugin, causing cast failures,
double-initialization, or singleton drift. Sharing a single manager removes
this whole class of bug.
h2. Inconsistent class loader policy across consumers
Only the listener call site enriches the configuration with parent-first
patterns for {{io.fabric8}} and {{com.fasterxml}} so listener plugins share the
operator's fabric8 client and Jackson. The other four consumers do not, which
means a metric reporter, file system factory, validator, or mutator that
bundles its own version of those libraries can end up with a duplicate fabric8
client (its own informers, its own connection state) or version-mismatched
Jackson. Centralizing the policy makes it consistent across every plugin type.
h1. Goal
Build the {{PluginManager}} exactly once per JVM (operator and webhook) and
pass it explicitly into every consumer that performs plugin discovery. Promote
the listener-only parent-first patterns to a process-wide default so all plugin
types share the operator's fabric8 client and Jackson.
h1. Scope
* Operator JVM: collapse four {{PluginManager}} constructions into one.
* Webhook JVM: collapse two into one.
* Class loader policy: apply the {{io.fabric8}} and {{com.fasterxml}}
parent-first patterns to all plugin types, not just listeners.
This is startup-only. The reconcile loop and runtime behavior are unaffected.
No CRD, public API, or dependency changes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)