Dennis-Mircea Ciupitu created FLINK-39688:
---------------------------------------------

             Summary: Share a single PluginManager across operator and webhook 
startup
                 Key: FLINK-39688
                 URL: https://issues.apache.org/jira/browse/FLINK-39688
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.15.0
            Reporter: Dennis-Mircea Ciupitu
             Fix For: kubernetes-operator-1.16.0


h1. Summary

The Flink Kubernetes Operator constructs a brand-new {{PluginManager}} four 
separate times during operator JVM startup, plus two more times during webhook 
JVM startup. Each call independently rescans {{{}/opt/flink/plugins{}}}, builds 
isolated child class loaders for every plugin subdirectory, and emits a 
duplicate {{"Plugin loader with ID not found, creating it: <subdir>"}} batch in 
the logs. This refactor consolidates plugin discovery onto a single shared 
{{PluginManager}} instance per JVM.
h1. Problem

Five different consumers of the Flink plugin SPI each call 
{{PluginUtils.createPluginManagerFromRootFolder(...)}} on their own:
 * {{OperatorMetricUtils.initOperatorMetrics}} (metric reporters)
 * {{ValidatorUtils.discoverValidators}} (validators)
 * {{ListenerUtils.discoverListeners}} (resource listeners)
 * {{FlinkOperator}} constructor (file system factories, via 
{{{}FileSystem.initialize{}}})
 * {{MutatorUtils.discoverMutators}} (mutating webhook plugins)

In the operator process this produces four full plugin scans during startup. In 
the webhook process it produces two more. Each scan walks the plugin folder, 
opens every JAR manifest, and instantiates a fresh isolated {{URLClassLoader}} 
per plugin subdirectory. Those class loaders are retained for the lifetime of 
the JVM because the loaded SPI factories pin them.
h1. Symptoms
h2. Log noise on every startup

A representative operator startup log shows the same eight {{flink-metrics-*}} 
subdirectories being announced four times in a row, drowning out meaningful 
startup messages:
{noformat}
o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found, 
creating it: flink-metrics-slf4j
o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found, 
creating it: flink-metrics-statsd
... (repeated 4x)
{noformat}
h2. Wasted memory and I/O

Four redundant {{PluginManager}} instances mean four times the class loaders 
for the same JAR set, all retained for the JVM lifetime. The repeated folder 
walks and JAR opens are also pure overhead at startup.
h2. Latent class loader identity hazard

Today each call site loads a different SPI ({{{}MetricReporterFactory{}}}, 
{{{}FileSystemFactory{}}}, {{{}FlinkResourceValidator{}}}, 
{{{}FlinkResourceListener{}}}, {{{}FlinkResourceMutator{}}}), so no plugin 
class is loaded twice through different class loaders. Having a plugin 
participating in two SPIs, the duplicated managers would silently produce two 
distinct {{Class}} instances for the same plugin, causing cast failures, 
double-initialization, or singleton drift. Sharing a single manager removes 
this whole class of bug.
h2. Inconsistent class loader policy across consumers

Only the listener call site enriches the configuration with parent-first 
patterns for {{io.fabric8}} and {{com.fasterxml}} so listener plugins share the 
operator's fabric8 client and Jackson. The other four consumers do not, which 
means a metric reporter, file system factory, validator, or mutator that 
bundles its own version of those libraries can end up with a duplicate fabric8 
client (its own informers, its own connection state) or version-mismatched 
Jackson. Centralizing the policy makes it consistent across every plugin type.
h1. Goal

Build the {{PluginManager}} exactly once per JVM (operator and webhook) and 
pass it explicitly into every consumer that performs plugin discovery. Promote 
the listener-only parent-first patterns to a process-wide default so all plugin 
types share the operator's fabric8 client and Jackson.
h1. Scope
 * Operator JVM: collapse four {{PluginManager}} constructions into one.
 * Webhook JVM: collapse two into one.
 * Class loader policy: apply the {{io.fabric8}} and {{com.fasterxml}} 
parent-first patterns to all plugin types, not just listeners.

This is startup-only. The reconcile loop and runtime behavior are unaffected. 
No CRD, public API, or dependency changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to