[PR] Make Tier-A metadata operations HA when a DataNode is down (metadata lease + self-fencing) [iotdb]

via GitHub Wed, 03 Jun 2026 00:22:26 -0700


JackieTien97 opened a new pull request, #17831:
URL: https://github.com/apache/iotdb/pull/17831


   ## Problem
   
   Many cluster metadata operations broadcast a cache-invalidation to **every** 
registered DataNode and treat *any* unreachable DataNode as a hard failure. As 
a result, with multiple replicas configured, a **single down DataNode** still 
makes these operations fail — contradicting the HA goal. The original report 
was table-model DDL (`CREATE TABLE`, `ALTER TABLE ...`), but the same pattern 
affects ~all Tier-A metadata ops (tree-model schema, templates, views, TTL, 
`DROP DATABASE`, permissions).
   
   The reason the broadcast cannot simply "skip" an unreachable DataNode is 
**correctness**: a DataNode caches ConfigNode-pushed metadata (table/tree 
schema, permissions, TTL, ...). If it misses an invalidation during a network 
partition and keeps serving the stale cache, it can produce dirty data / stale 
reads / stale authorization.
   
   ## Approach — metadata lease + self-fencing + a broadcast verdict
   
   1. **DataNode self-fencing (DN side, fail-closed).** The ConfigNode already 
heartbeats every DataNode. A DataNode records the last ConfigNode-heartbeat 
time on its own monotonic clock; if none arrives within 
`metadata_lease_fence_ms` (`T_fence`, default 20s, aligned with the failure 
detector), it considers itself **fenced** and stops trusting ConfigNode-pushed 
caches. On the next heartbeat (recovery) it resyncs.
   2. **ConfigNode broadcast verdict (CN side, liveness).** Instead of "any 
unreachable DataNode fails the op", the ConfigNode proceeds once every un-acked 
DataNode is **provably self-fenced** — out of contact for at least `T_proceed = 
T_fence + margin`, *and* known to support fencing. Such a DataNode is 
fail-closed and will resync, so it cannot serve dirty data. A DataNode that is 
reachable-but-unacked (still possibly serving) blocks the op (wait, then fail).
   
   This gives the desired CP behavior: a healthy majority makes progress; a 
partitioned minority fails closed.
   
   ## What this PR contains
   
   **DataNode side (fail-closed when fenced):**
   - `MetadataLeaseManager` — lease tracking, lazy fence check, recovery 
listeners (monotonic clock, injectable for tests)
   - Table schema cache (`DataNodeTableCache`) — throws retryable; tree schema 
cache (`TreeDeviceSchemaCacheManager`) — forces re-fetch from the quorum-backed 
SchemaRegion; permission cache (`ClusterAuthorityFetcher`) — drops cache → 
re-fetch (deny while partitioned); TTL — compaction uses an infinite TTL when 
fenced (never deletes by a stale TTL)
   - Heartbeat records contact + reports a `supportsMetadataLeaseFencing` 
capability bit; a `metadata_lease_heartbeat_age_ms` metric
   
   **ConfigNode side (proceed past provably-fenced DataNodes):**
   - `MetadataBroadcastVerdict` (pure decision logic) + 
`DataNodeContactTracker` (sound per-DataNode last-successful-response signal, 
capability bit) + `ClusterCachePropagator` (broadcast → verdict → 
PROCEED/WAIT/FAIL with retry)
   - Leadership/removal lifecycle hooks for the contact tracker
   - All Tier-A procedures wired through the propagator: table DDL 
(`CreateTable` + all 8 alter/drop procedures, forward + rollback), tree-model 
schema (`DeleteTimeSeries`, `AlterTimeSeriesDataType`, 
`AlterEncodingCompressor`, logical views, `DeactivateTemplate`), templates 
(`Set`/`Unset`), `SetTTL`, and `DeleteDatabase` (sync path). Region-task / 
physical-deletion broadcasts are deliberately left on region consensus.
   - `AuthOperationProcedure`: fixed a silent permission-staleness hole (it 
used to drop an unacked DataNode after a timeout, leaving a live DataNode 
serving a just-revoked permission); now proceeds only past provably-fenced 
DataNodes.
   
   ## Testing
   
   - Unit tests for every new component and each fail-closed injection 
(`MetadataLeaseManager`, `DataNodeTableCache`, tree schema, permission, 
TTL-compaction, `MetadataBroadcastVerdict`, `DataNodeContactTracker`, 
`ClusterCachePropagator`), plus no-regression on the affected procedures.
   - **1C3D integration test** (`IoTDBTableDDLHAIT`): with one DataNode 
stopped, `CREATE TABLE` **succeeds** after ~`T_proceed` (instead of failing 
immediately). A new `setMetadataLeaseFenceMs` IT-config setter keeps the wait 
short.
   
   ## Configuration & compatibility
   
   - New knob `metadata_lease_fence_ms` (default 20000). `T_proceed = T_fence + 
internal margin (~5s)`.
   - Takes effect on upgrade, no switch. Rolling-upgrade safe via the 
capability bit: an old DataNode (no fencing) is never treated as fenced, so the 
ConfigNode falls back to the strict (current) semantics for it; correctness is 
preserved DataNode-by-DataNode.
   
   ## Not in this PR (follow-up)
   
   Tier-B resource operations (DROP/CREATE `FUNCTION`/`TRIGGER`/`PIPE PLUGIN`, 
`SET SYSTEM STATUS=ReadOnly`) and the quota re-pull. These need DataNode-side 
fail-closed on resource **execution** (not a read-through cache) plus a 
resource recovery-resync mechanism, which is a separate, self-contained effort.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Make Tier-A metadata operations HA when a DataNode is down (metadata lease + self-fencing) [iotdb]

Reply via email to