Hi Chris,

The existing heartbeat mechanism on the ConfigNode side already covers part
of this. It can mark a disconnected DataNode as *Unknown*.

The remaining issue is that the ConfigNode cannot tell whether that
DataNode is really down, or whether it is only isolated by a network
partition. In the network partition case, the DataNode may still be
running, and its local caches may not have been cleared. If a client
connects directly to that DataNode, the request may still be handled based
on potentially stale or incorrect cache contents.

So I think the main purpose of the lease mechanism is to give the DataNode
a clear point at which it actively clears its own caches and stops serving
external requests. It also gives the ConfigNode a deterministic boundary:
after the lease expires, regardless of whether the DataNode is down or
partitioned, it can no longer provide service externally. Then the
ConfigNode can safely allow the metadata procedure to continue.


Best regards,

---------------------------

Yuan Tian

On Thu, 2 Jul 2026 21:59:25 +0000, Christofer Dutz [email protected]
wrote:

Would a periodic background ping operation, marking nodes as unreliable not
be an option?

Chris

Gesendet von Outlook für Androidhttps://aka.ms/AAb9ysg

From: alpass [email protected]
Sent: Thursday, 02 July 2026 12:42:46
To: [email protected] [email protected]
Subject: [DISCUSS] introduce the lease and self-fence mechanism to support
high availability of table metadata operation procedure

Hi everyone,

I would like to initiate a discussion on introducing a lease-based
self-fencing framework to guarantee High Availability (HA) for table
metadata operations in IoTDB.

   1. Background & Problem Statement

Currently, in a cluster deployment (e.g., 1 ConfigNode, 3 DataNodes),
metadata operations lack true high availability. If a single DataNode (DN)
crashes or experiences a network partition, execution of DDL procedures
(such as Create/Alter/Drop Table, View management, TTL adjustments, and
Drop Database) will directly fail and trigger a rollback.

   1. Proposed Technical Solution

To solve this, the Lease-based Self-Fencing Framework is proposed. The core
logic ensures that unresponsive nodes automatically isolate themselves,
allowing the rest of the cluster to safely proceed with schema changes.

DataNode-Side: Self-Fencing

MetadataLeaseManager: Tracks a lease updated by periodic CN heartbeats
using a monotonic clock.

Fencing Mechanism: If no heartbeat is received within
metadata_lease_fence_ms (new config, such as 20,000 ms as default), the DN
determines it is isolated and self-fences. It clears its cache and
explicitly blocks all read/write operations by throwing a exception.

ConfigNode-Side: Broadcast Coordination

DataNodeContactTracker: Precisely records the timestamp of the last
successful heartbeat response for each DN.

ClusterCachePropagator & MetadataBroadcastVerdict: When broadcasting table
metadata operation procedure, the CN includes a retry loop. If a DN is
unacknowledged, the CN waits until T_proceed (the fence duration + an
internal margin). Once the time elapses, the CN logically proves the silent
DN has self-fenced and allows the procedure to proceed safely.

Schema State Transitions

For deletion operations (Drop Table/View), a PreDeleteTsTable marker state
is introduced. This ensures safe state transitions and remind user of
deleting again when procedure fails.

   1. Expected Behavior Summary

With this framework, the expected cluster behavior under abnormal
conditions will be:

DN Down: Schema operations execute successfully. The CN waits for the
timeout, confirms the node is dead/fenced, and proceeds.

DN Network Partition: Schema operations execute successfully on the healthy
cluster majority. The partitioned DN self-fences. Any client connecting to
the partitioned DN to execute data/metadata operations will receive an
explicit exception stating the node is isolated.

I would highly appreciate your thoughts, feedback, or any suggestions you
might have on this proposed design.

   1. 背景

目前,在集群部署(例如:1 个 ConfigNode,3 个 DataNode)中,元数据操作缺乏真正的高可用性。如果单个
DataNode(DN)发生崩溃或遭遇网络分区,表 scheme 相关的DDL执行(如创建/修改/删除表、视图管理、TTL
调整以及删除数据库)将直接失败并触发回滚。

   1. 拟定技术方案

为了解决这一问题,我们提出了基于租约的自隔离框架。其核心逻辑在于确保未响应的节点能够自动进行自我隔离,从而允许集群的其余部分安全地继续进行模式(Schema)变更。

DataNode 端:自我隔离(Self-Fencing)

MetadataLeaseManager(元数据租约管理器):通过单调时钟(Monotonic Clock)跟踪由
ConfigNode(CN)定期心跳更新的租约。

隔离机制(Fencing Mechanism):如果在 metadata_lease_fence_ms(新增配置项,默认值设为 20,000
毫秒)内未收到心跳,DN 将判定自己已被隔离并执行自隔离。它会清空自身的缓存,并通过抛出异常的方式明确拦截所有的读写操作。

ConfigNode 端:广播协调(Broadcast Coordination)

DataNodeContactTracker(DataNode 状态追踪器):精准记录每个 DN 最后一次成功响应心跳的时间戳。

ClusterCachePropagator(集群缓存传播器)与
MetadataBroadcastVerdict(元数据广播裁决器):在广播缓存失效信息时,CN 会引入重试循环。如果某个 DN
未响应(Unacknowledged),CN 将等待至一段时间(即隔离超时时间 + 内部容错余量)。一旦该时间过去,CN 即可在逻辑上证实该静默的
DN 已经执行了自隔离,并允许存储过程安全地继续向下执行。

模式状态转换(Schema State Transitions)

对于删除操作(删除表/视图),引入了 PreDeleteTsTable(预删除)标记状态。这确保了状态的安全转换,并在删除完全失败提示用户手动去删除

   1. 预期行为

引入该框架后,集群在异常条件下的预期行为如下:

DN 宕机:Schema 操作可以成功执行。CN 会等待超时,确认该节点已死亡/被隔离后继续执行后续流程。

DN 网络分区:Schema 操作在健康的多数派集群上成功执行。被分区的 DN 会执行自隔离。任何连接到该分区 DN
以执行数据/元数据操作的客户端,都将收到一个明确的异常提示,说明该节点已被隔离。

Best wishes
Yaobin Chen

Reply via email to