platinumhamburg opened a new issue, #2935: URL: https://github.com/apache/fluss/issues/2935
### Search before asking - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar. ### Fluss version 0.9.0 (latest release) ### Please describe the bug 🐞 ## Description `TableChangeWatcherTest.testSchemaChanges` intermittently fails on CI with a missing `CreateTableEvent` for one of the 10 created tables. ## Observed Failure ``` Expecting actual: [SchemaChangeEvent{table_0}, CreateTableEvent{table_0}, ..., SchemaChangeEvent{table_4}, SchemaChangeEvent{table_5}, CreateTableEvent{table_5}, ...] but could not find: [CreateTableEvent{table_4}] ``` Key observations from CI log: - 20 `SchemaChangeEvent`s received (expected 10) — table_4 has a **duplicate** SchemaChangeEvent - 19 `CreateTableEvent`s received (expected 10) — table_4's `CreateTableEvent` is **missing** - The missing event is replaced by an extra `SchemaChangeEvent`, suggesting stale data interference ## Root Cause Hypothesis `testTableChanges` and `testSchemaChanges` both create tables named `table_0` through `table_9`. In `beforeEach`, the database is dropped and recreated, and a new `TableChangeWatcher` (backed by `CuratorCache`) is started. However, `CuratorCache.start()` performs an asynchronous initial tree scan. If the initial sync has not completed before new tables are created, the cache may hold stale `oldData` from the previous test's ZK nodes. When the new table_4 node is written, `CuratorCache` fires `NODE_CHANGED` with non-empty `oldData` (from the stale cache), causing the event to be routed to `processTableRegistrationChange()` instead of `processCreateTable()`: ```java // TableChangeWatcher.java:134-140 if (oldData != null && oldData.getData() != null && oldData.getData().length > 0) { processTableRegistrationChange(tablePath, newData); // ← wrong branch } else { processCreateTable(tablePath, newData); // ← correct branch } ``` This explains why: - Only 1 table is affected (depends on timing of initial sync vs table creation) - `SchemaChangeEvent` is duplicated (initial sync picks up stale schema node) - `CreateTableEvent` is lost (routed to wrong handler) ## Reproducibility This race cannot be reliably reproduced locally (50/50 passes) because local ZooKeeper is fast enough that `CuratorCache` initial sync completes before any table creation begins. The race only manifests on slow CI hosts with shared resources. ### Solution - Add `awaitInitialized()` to `TableChangeWatcher` using `CuratorCacheListener.initialized()` callback - Call `awaitInitialized()` in test `beforeEach` after `start()` to ensure cache is fully synced - Clear stale events after initial sync completes ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
