[ 
https://issues.apache.org/jira/browse/FLINK-39688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-39688.
------------------------------
    Resolution: Fixed

merged to main e4c56b2be3d844d4c12176f281972a8cc3de49b2

 

> Share a single PluginManager across operator and webhook startup
> ----------------------------------------------------------------
>
>                 Key: FLINK-39688
>                 URL: https://issues.apache.org/jira/browse/FLINK-39688
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.15.0
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.16.0
>
>
> h1. Summary
> The Flink Kubernetes Operator constructs a brand-new {{PluginManager}} four 
> separate times during operator JVM startup, plus two more times during 
> webhook JVM startup. Each call independently rescans 
> {{{}/opt/flink/plugins{}}}, builds isolated child class loaders for every 
> plugin subdirectory, and emits a duplicate {{"Plugin loader with ID not 
> found, creating it: <subdir>"}} batch in the logs. This refactor consolidates 
> plugin discovery onto a single shared {{PluginManager}} instance per JVM.
> h1. Problem
> Five different consumers of the Flink plugin SPI each call 
> {{PluginUtils.createPluginManagerFromRootFolder(...)}} on their own:
>  * {{OperatorMetricUtils.initOperatorMetrics}} (metric reporters)
>  * {{ValidatorUtils.discoverValidators}} (validators)
>  * {{ListenerUtils.discoverListeners}} (resource listeners)
>  * {{FlinkOperator}} constructor (file system factories, via 
> {{{}FileSystem.initialize{}}})
>  * {{MutatorUtils.discoverMutators}} (mutating webhook plugins)
> In the operator process this produces four full plugin scans during startup. 
> In the webhook process it produces two more. Each scan walks the plugin 
> folder, opens every JAR manifest, and instantiates a fresh isolated 
> {{URLClassLoader}} per plugin subdirectory. Those class loaders are retained 
> for the lifetime of the JVM because the loaded SPI factories pin them.
> h1. Symptoms
> h2. Log noise on every startup
> A representative operator startup log shows the same eight 
> {{flink-metrics-*}} subdirectories being announced four times in a row, 
> drowning out meaningful startup messages:
> {noformat}
> o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found, 
> creating it: flink-metrics-slf4j
> o.a.f.c.p.DefaultPluginManager [INFO ] Plugin loader with ID not found, 
> creating it: flink-metrics-statsd
> ... (repeated 4x)
> {noformat}
> h2. Wasted memory and I/O
> Four redundant {{PluginManager}} instances mean four times the class loaders 
> for the same JAR set, all retained for the JVM lifetime. The repeated folder 
> walks and JAR opens are also pure overhead at startup.
> h2. Latent class loader identity hazard
> Today each call site loads a different SPI ({{{}MetricReporterFactory{}}}, 
> {{{}FileSystemFactory{}}}, {{{}FlinkResourceValidator{}}}, 
> {{{}FlinkResourceListener{}}}, {{{}FlinkResourceMutator{}}}), so no plugin 
> class is loaded twice through different class loaders. Having a plugin 
> participating in two SPIs, the duplicated managers would silently produce two 
> distinct {{Class}} instances for the same plugin, causing cast failures, 
> double-initialization, or singleton drift. Sharing a single manager removes 
> this whole class of bug.
> h2. Inconsistent class loader policy across consumers
> Only the listener call site enriches the configuration with parent-first 
> patterns for {{io.fabric8}} and {{com.fasterxml}} so listener plugins share 
> the operator's fabric8 client and Jackson. The other four consumers do not, 
> which means a metric reporter, file system factory, validator, or mutator 
> that bundles its own version of those libraries can end up with a duplicate 
> fabric8 client (its own informers, its own connection state) or 
> version-mismatched Jackson. Centralizing the policy makes it consistent 
> across every plugin type.
> h1. Goal
> Build the {{PluginManager}} exactly once per JVM (operator and webhook) and 
> pass it explicitly into every consumer that performs plugin discovery. 
> Promote the listener-only parent-first patterns to a process-wide default so 
> all plugin types share the operator's fabric8 client and Jackson.
> h1. Scope
>  * Operator JVM: collapse four {{PluginManager}} constructions into one.
>  * Webhook JVM: collapse two into one.
>  * Class loader policy: apply the {{io.fabric8}} and {{com.fasterxml}} 
> parent-first patterns to all plugin types, not just listeners.
> This is startup-only. The reconcile loop and runtime behavior are unaffected. 
> No CRD, public API, or dependency changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to