qqeasonchen opened a new issue, #5214: URL: https://github.com/apache/eventmesh/issues/5214
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/eventmesh/issues?q=is%3Aissue) and found no similar issues. ### Enhancement Request Description The current implementation of eventmesh-operator has several critical issues regarding Kubernetes resource management and internal concurrency safety, which may lead to deployment failures or unstable behavior in production environments. Issues Identified 1. Missing Headless Service for StatefulSets * Problem: The operator creates StatefulSet resources for both Runtime and Connectors but fails to create the corresponding Headless Service. It also does not set the serviceName field in the StatefulSet spec. * Impact: Pods managed by the StatefulSet will not have stable network identities (DNS entries like pod-0.service-name.namespace.svc.cluster.local), which is a core feature of StatefulSets and essential for cluster communication. 2. Unsafe Global State Usage * Problem: A global variable IsEventMeshRuntimeInitialized in share/share.go is used to track runtime readiness. * Impact: This design is not thread-safe and breaks in multi-tenant or multi-cluster scenarios (e.g., managing multiple EventMesh clusters in different namespaces). It causes race conditions and incorrect dependency checks. 3. Hardcoded Replica Logic * Problem: The RuntimeReconciler hardcodes Replicas to 1 in some paths, potentially ignoring the replicaPerGroup configuration defined in the CRD. 4. Blocking Operations * Problem: The controller uses time.Sleep() for retries or waiting. * Impact: This blocks the reconciliation thread, reducing the operator's throughput and responsiveness. It should use reconcile.Result{RequeueAfter: ...} instead. ### Describe the solution you'd like Proposed Fixes 1. Refactor Controllers: * Implement logic to automatically create a Headless Service (ClusterIP: None) for each StatefulSet. * Ensure the StatefulSet.Spec.ServiceName matches the created Service. 2. Remove Global State: * Delete IsEventMeshRuntimeInitialized. * Update ConnectorsReconciler to dynamically query the Kubernetes API for Runtime resource status to determine readiness. 3. Enhance Robustness: * Use correct replica values from the CRSpec. * Replace blocking sleeps with non-blocking requeue mechanisms. Environment * EventMesh Version: (Current Master) * Kubernetes Version: Any ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) * -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
