JackieTien97 opened a new pull request, #17831: URL: https://github.com/apache/iotdb/pull/17831
## Problem Many cluster metadata operations broadcast a cache-invalidation to **every** registered DataNode and treat *any* unreachable DataNode as a hard failure. As a result, with multiple replicas configured, a **single down DataNode** still makes these operations fail — contradicting the HA goal. The original report was table-model DDL (`CREATE TABLE`, `ALTER TABLE ...`), but the same pattern affects ~all Tier-A metadata ops (tree-model schema, templates, views, TTL, `DROP DATABASE`, permissions). The reason the broadcast cannot simply "skip" an unreachable DataNode is **correctness**: a DataNode caches ConfigNode-pushed metadata (table/tree schema, permissions, TTL, ...). If it misses an invalidation during a network partition and keeps serving the stale cache, it can produce dirty data / stale reads / stale authorization. ## Approach — metadata lease + self-fencing + a broadcast verdict 1. **DataNode self-fencing (DN side, fail-closed).** The ConfigNode already heartbeats every DataNode. A DataNode records the last ConfigNode-heartbeat time on its own monotonic clock; if none arrives within `metadata_lease_fence_ms` (`T_fence`, default 20s, aligned with the failure detector), it considers itself **fenced** and stops trusting ConfigNode-pushed caches. On the next heartbeat (recovery) it resyncs. 2. **ConfigNode broadcast verdict (CN side, liveness).** Instead of "any unreachable DataNode fails the op", the ConfigNode proceeds once every un-acked DataNode is **provably self-fenced** — out of contact for at least `T_proceed = T_fence + margin`, *and* known to support fencing. Such a DataNode is fail-closed and will resync, so it cannot serve dirty data. A DataNode that is reachable-but-unacked (still possibly serving) blocks the op (wait, then fail). This gives the desired CP behavior: a healthy majority makes progress; a partitioned minority fails closed. ## What this PR contains **DataNode side (fail-closed when fenced):** - `MetadataLeaseManager` — lease tracking, lazy fence check, recovery listeners (monotonic clock, injectable for tests) - Table schema cache (`DataNodeTableCache`) — throws retryable; tree schema cache (`TreeDeviceSchemaCacheManager`) — forces re-fetch from the quorum-backed SchemaRegion; permission cache (`ClusterAuthorityFetcher`) — drops cache → re-fetch (deny while partitioned); TTL — compaction uses an infinite TTL when fenced (never deletes by a stale TTL) - Heartbeat records contact + reports a `supportsMetadataLeaseFencing` capability bit; a `metadata_lease_heartbeat_age_ms` metric **ConfigNode side (proceed past provably-fenced DataNodes):** - `MetadataBroadcastVerdict` (pure decision logic) + `DataNodeContactTracker` (sound per-DataNode last-successful-response signal, capability bit) + `ClusterCachePropagator` (broadcast → verdict → PROCEED/WAIT/FAIL with retry) - Leadership/removal lifecycle hooks for the contact tracker - All Tier-A procedures wired through the propagator: table DDL (`CreateTable` + all 8 alter/drop procedures, forward + rollback), tree-model schema (`DeleteTimeSeries`, `AlterTimeSeriesDataType`, `AlterEncodingCompressor`, logical views, `DeactivateTemplate`), templates (`Set`/`Unset`), `SetTTL`, and `DeleteDatabase` (sync path). Region-task / physical-deletion broadcasts are deliberately left on region consensus. - `AuthOperationProcedure`: fixed a silent permission-staleness hole (it used to drop an unacked DataNode after a timeout, leaving a live DataNode serving a just-revoked permission); now proceeds only past provably-fenced DataNodes. ## Testing - Unit tests for every new component and each fail-closed injection (`MetadataLeaseManager`, `DataNodeTableCache`, tree schema, permission, TTL-compaction, `MetadataBroadcastVerdict`, `DataNodeContactTracker`, `ClusterCachePropagator`), plus no-regression on the affected procedures. - **1C3D integration test** (`IoTDBTableDDLHAIT`): with one DataNode stopped, `CREATE TABLE` **succeeds** after ~`T_proceed` (instead of failing immediately). A new `setMetadataLeaseFenceMs` IT-config setter keeps the wait short. ## Configuration & compatibility - New knob `metadata_lease_fence_ms` (default 20000). `T_proceed = T_fence + internal margin (~5s)`. - Takes effect on upgrade, no switch. Rolling-upgrade safe via the capability bit: an old DataNode (no fencing) is never treated as fenced, so the ConfigNode falls back to the strict (current) semantics for it; correctness is preserved DataNode-by-DataNode. ## Not in this PR (follow-up) Tier-B resource operations (DROP/CREATE `FUNCTION`/`TRIGGER`/`PIPE PLUGIN`, `SET SYSTEM STATUS=ReadOnly`) and the quota re-pull. These need DataNode-side fail-closed on resource **execution** (not a read-through cache) plus a resource recovery-resync mechanism, which is a separate, self-contained effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
