[jira] [Created] (IOTDB-4389) [MultiLeaderConsensus] Stop the datanode and lose data after restarting the cluster

Jira Mon, 12 Sep 2022 20:13:06 -0700

刘珍 created IOTDB-4389:
-------------------------

             Summary: [MultiLeaderConsensus] Stop the datanode and lose data 
after restarting the cluster
                 Key: IOTDB-4389
                 URL: https://issues.apache.org/jira/browse/IOTDB-4389
             Project: Apache IoTDB
          Issue Type: Bug
          Components: mpp-cluster
    Affects Versions: 0.14.0-SNAPSHOT
            Reporter: 刘珍
            Assignee: Jinrui Zhang
         Attachments: 0913_after_cluster_restart_query.out, 
0913_bef_stop-cluster_with_autoflush_query.out, benchmark_down_datanode.conf, 
down_datanode.sh, image-2022-09-13-11-09-33-609.png


master_0909_bdd7ca8 ， 3C9D
9个datanode，间隔30分钟，执行 1次下线 1个datanode，下线时间为10分钟。
9个datanode均下线1次后，不再执行故障操作。

客户端运行完成 进程退出，（70小时后）执行重启前后的数据正确性验证：
 !image-2022-09-13-11-09-33-609.png! 

{color:#DE350B}*部分序列重启后，丢失10个点数据。*{color}
详细查询结果见附件。

测试流程
1. ConfigNode机器
172.20.70.31（leader） 8核32G
172.20.70.32/33 4核16G

confignode配置参数：
MAX_HEAP_SIZE="8G"
MAX_DIRECT_MEMORY_SIZE="4G"

schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
schema_replication_factor=3
data_replication_factor=3

jstack信息见附件

2. DataNode机器
172.20.70.2/3/4/5/13/14/16/18/19 8核32G
配置参数
MAX_HEAP_SIZE="20G"
MAX_DIRECT_MEMORY_SIZE="6G"

max_connection_for_internal_service=200
wal_buffer_size_in_byte=1048576
enable_timed_flush_seq_memtable=true
seq_memtable_flush_interval_in_ms=3600000
seq_memtable_flush_check_interval_in_ms=600000

enable_timed_flush_unseq_memtable=true
unseq_memtable_flush_interval_in_ms=3600000
unseq_memtable_flush_check_interval_in_ms=600000

max_waiting_time_when_insert_blocked=3600000
query_timeout_threshold=3600000

3. benchmark配置见附件
CLIENT_NUMBER=50
运行benchmark。
4. 运行down datanode 脚本
cat down_datanode.sh
#!/bin/bash
node1="172.20.70.4"
node2="172.20.70.5"
node3="172.20.70.3"
node4="172.20.70.2"
node5="172.20.70.13"
node6="172.20.70.14"
node7="172.20.70.16"
node8="172.20.70.18"
node9="172.20.70.19"

cluster_dir="/data/iotdb"
cur_cluster="master_0909_bdd7ca8"
u_name="cluster"

function down_datanode()
{
t=`date '+%Y-%m-%d %H:%M:%S'`
echo "${t}"
node=$1
${cluster_dir}/${cur_cluster}/datanode/sbin/start-cli.sh -h ${node} -e "show 
cluster"
${cluster_dir}/${cur_cluster}/datanode/sbin/start-cli.sh -h ${node} -e "show 
regions"
ssh ${u_name}@${node} "source 
/etc/profile;${cluster_dir}/${cur_cluster}/datanode/sbin/stop-datanode.sh"
sleep 10m
ssh ${u_name}@${node} "source 
/etc/profile;${cluster_dir}/${cur_cluster}/datanode/sbin/start-datanode.sh > 
/dev/null 2>&1 &"
sleep 30m

}
sleep 30m
down_datanode ${node1}
down_datanode ${node2}
down_datanode ${node3}
down_datanode ${node4}
down_datanode ${node5}
down_datanode ${node6}
down_datanode ${node7}
down_datanode ${node8}
down_datanode ${node9}

5. benchmark运行16.25小时

6.（70小时后）执行查询 正确性验证
停集群前执行查询
select count(s_5) from root.** align by device；
清os 缓存
重启集群
select count(s_5) from root.** align by device；
对比2个结果



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IOTDB-4389) [MultiLeaderConsensus] Stop the datanode and lose data after restarting the cluster

Reply via email to