Hi Yaobin,

I think the overall direction of this proposal looks OK to me.

One thing I hope we can make clearer is the exact scope of this change. For
example, after introducing the lease mechanism, which operations can
actually benefit from it, and which operations are still not covered by
this mechanism? It would be helpful to make this explicit in the design and
in the PR, so reviewers and users can better understand the expected
behavior and the remaining limitations.

If you open a PR for this, I would be happy to review it.



我认为这个方案整体方向是 OK 的。

不过希望可以进一步明确一下这次功能的范围。例如,引入租约机制之后,哪些操作能够从这个机制中受益,哪些操作目前依旧还不能覆盖。把这些范围和限制在设计或
PR 中说明清楚,会更方便大家 review,也能让用户更明确地理解预期行为和剩余限制。

如果后续有 PR,我也很乐意帮忙 review。

Best regards,

------------------------
Yuan Tian

On Thu, 2 Jul 2026 18:42:46 +0800 (CST), alpass [email protected] wrote:

Hi everyone,

I would like to initiate a discussion on introducing a lease-based
self-fencing framework to guarantee High Availability (HA) for table
metadata operations in IoTDB.

   1. Background & Problem Statement

Currently, in a cluster deployment (e.g., 1 ConfigNode, 3 DataNodes),
metadata operations lack true high availability. If a single DataNode (DN)
crashes or experiences a network partition, execution of DDL procedures
(such as Create/Alter/Drop Table, View management, TTL adjustments, and
Drop Database) will directly fail and trigger a rollback.

   1. Proposed Technical Solution

To solve this, the Lease-based Self-Fencing Framework is proposed. The core
logic ensures that unresponsive nodes automatically isolate themselves,
allowing the rest of the cluster to safely proceed with schema changes.

DataNode-Side: Self-Fencing

MetadataLeaseManager: Tracks a lease updated by periodic CN heartbeats
using a monotonic clock.

Fencing Mechanism: If no heartbeat is received within
metadata_lease_fence_ms (new config, such as 20,000 ms as default), the DN
determines it is isolated and self-fences. It clears its cache and
explicitly blocks all read/write operations by throwing a exception.

ConfigNode-Side: Broadcast Coordination

DataNodeContactTracker: Precisely records the timestamp of the last
successful heartbeat response for each DN.

ClusterCachePropagator & MetadataBroadcastVerdict: When broadcasting table
metadata operation procedure, the CN includes a retry loop. If a DN is
unacknowledged, the CN waits until T_proceed (the fence duration + an
internal margin). Once the time elapses, the CN logically proves the silent
DN has self-fenced and allows the procedure to proceed safely.

Schema State Transitions

For deletion operations (Drop Table/View), a PreDeleteTsTable marker state
is introduced. This ensures safe state transitions and remind user of
deleting again when procedure fails.

   1. Expected Behavior Summary

With this framework, the expected cluster behavior under abnormal
conditions will be:

DN Down: Schema operations execute successfully. The CN waits for the
timeout, confirms the node is dead/fenced, and proceeds.

DN Network Partition: Schema operations execute successfully on the healthy
cluster majority. The partitioned DN self-fences. Any client connecting to
the partitioned DN to execute data/metadata operations will receive an
explicit exception stating the node is isolated.

I would highly appreciate your thoughts, feedback, or any suggestions you
might have on this proposed design.

   1. 背景

目前,在集群部署(例如:1 个 ConfigNode,3 个 DataNode)中,元数据操作缺乏真正的高可用性。如果单个
DataNode(DN)发生崩溃或遭遇网络分区,表 scheme 相关的DDL执行(如创建/修改/删除表、视图管理、TTL
调整以及删除数据库)将直接失败并触发回滚。

   1. 拟定技术方案

为了解决这一问题,我们提出了基于租约的自隔离框架。其核心逻辑在于确保未响应的节点能够自动进行自我隔离,从而允许集群的其余部分安全地继续进行模式(Schema)变更。

DataNode 端:自我隔离(Self-Fencing)

MetadataLeaseManager(元数据租约管理器):通过单调时钟(Monotonic Clock)跟踪由
ConfigNode(CN)定期心跳更新的租约。

隔离机制(Fencing Mechanism):如果在 metadata_lease_fence_ms(新增配置项,默认值设为 20,000
毫秒)内未收到心跳,DN 将判定自己已被隔离并执行自隔离。它会清空自身的缓存,并通过抛出异常的方式明确拦截所有的读写操作。

ConfigNode 端:广播协调(Broadcast Coordination)

DataNodeContactTracker(DataNode 状态追踪器):精准记录每个 DN 最后一次成功响应心跳的时间戳。

ClusterCachePropagator(集群缓存传播器)与
MetadataBroadcastVerdict(元数据广播裁决器):在广播缓存失效信息时,CN 会引入重试循环。如果某个 DN
未响应(Unacknowledged),CN 将等待至一段时间(即隔离超时时间 + 内部容错余量)。一旦该时间过去,CN 即可在逻辑上证实该静默的
DN 已经执行了自隔离,并允许存储过程安全地继续向下执行。

模式状态转换(Schema State Transitions)

对于删除操作(删除表/视图),引入了 PreDeleteTsTable(预删除)标记状态。这确保了状态的安全转换,并在删除完全失败提示用户手动去删除

   1. 预期行为

引入该框架后,集群在异常条件下的预期行为如下:

DN 宕机:Schema 操作可以成功执行。CN 会等待超时,确认该节点已死亡/被隔离后继续执行后续流程。

DN 网络分区:Schema 操作在健康的多数派集群上成功执行。被分区的 DN 会执行自隔离。任何连接到该分区 DN
以执行数据/元数据操作的客户端,都将收到一个明确的异常提示,说明该节点已被隔离。

Best wishes
Yaobin Chen

Reply via email to