qqeasonchen opened a new issue, #5214:
URL: https://github.com/apache/eventmesh/issues/5214

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/eventmesh/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Enhancement Request
   
   Description
     The current implementation of eventmesh-operator has several critical 
issues regarding Kubernetes resource management and internal concurrency 
safety, which may lead to deployment failures
     or unstable behavior in production environments.
   
     Issues Identified
   
      1. Missing Headless Service for StatefulSets
          * Problem: The operator creates StatefulSet resources for both 
Runtime and Connectors but fails to create the corresponding Headless Service. 
It also does not set the serviceName
            field in the StatefulSet spec.
          * Impact: Pods managed by the StatefulSet will not have stable 
network identities (DNS entries like 
pod-0.service-name.namespace.svc.cluster.local), which is a core feature of
            StatefulSets and essential for cluster communication.
   
      2. Unsafe Global State Usage
          * Problem: A global variable IsEventMeshRuntimeInitialized in 
share/share.go is used to track runtime readiness.
          * Impact: This design is not thread-safe and breaks in multi-tenant 
or multi-cluster scenarios (e.g., managing multiple EventMesh clusters in 
different namespaces). It causes race
            conditions and incorrect dependency checks.
   
      3. Hardcoded Replica Logic
          * Problem: The RuntimeReconciler hardcodes Replicas to 1 in some 
paths, potentially ignoring the replicaPerGroup configuration defined in the 
CRD.
   
      4. Blocking Operations
          * Problem: The controller uses time.Sleep() for retries or waiting.
          * Impact: This blocks the reconciliation thread, reducing the 
operator's throughput and responsiveness. It should use 
reconcile.Result{RequeueAfter: ...} instead.
   
     
   
   ### Describe the solution you'd like
   
   Proposed Fixes
   
      1. Refactor Controllers:
          * Implement logic to automatically create a Headless Service 
(ClusterIP: None) for each StatefulSet.
          * Ensure the StatefulSet.Spec.ServiceName matches the created Service.
   
      2. Remove Global State:
          * Delete IsEventMeshRuntimeInitialized.
          * Update ConnectorsReconciler to dynamically query the Kubernetes API 
for Runtime resource status to determine readiness.
   
      3. Enhance Robustness:
          * Use correct replica values from the CRSpec.
          * Replace blocking sleeps with non-blocking requeue mechanisms.
   
     Environment
      * EventMesh Version: (Current Master)
      * Kubernetes Version: Any
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct) *


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to