[
https://issues.apache.org/jira/browse/IOTDB-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603540#comment-17603540
]
Jinrui Zhang commented on IOTDB-4389:
-------------------------------------
Before repro of this case, I have two assumptions regarding to data lost.
# If the leader is on the DataNode which is down, the write operations from BM
fails
# WAL is not in `Sync` mode, which will lead to data lost when restarting of
DataNode
> [MultiLeaderConsensus] Stop 1 datanode at a time for 10 minutes ,after
> cluster restarts, some timeseries have more data and some have less data
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: IOTDB-4389
> URL: https://issues.apache.org/jira/browse/IOTDB-4389
> Project: Apache IoTDB
> Issue Type: Bug
> Components: mpp-cluster
> Affects Versions: 0.14.0-SNAPSHOT
> Reporter: 刘珍
> Assignee: Jinrui Zhang
> Priority: Major
> Attachments: 0913_after_cluster_restart_query.out,
> 0913_bef_stop-cluster_with_autoflush_query.out, benchmark_down_datanode.conf,
> down_datanode.sh, image-2022-09-13-11-09-33-609.png, screenshot-1.png
>
>
> master_0909_bdd7ca8 , 3C9D
> 9个datanode,间隔30分钟,执行 1次下线 1个datanode,下线时间为10分钟。
> 9个datanode均下线1次后,不再执行故障操作。
> 客户端运行完成
> 进程退出,(70小时后)执行重启前后的数据正确性验证,{color:#DE350B}*部分序列重启后,有的丢失10个点数据,有的多10个点数据。*{color}:
> !image-2022-09-13-11-09-33-609.png!
> {color:#DE350B}*下图红框多10个点数据:*{color}
> !screenshot-1.png!
> 详细查询结果见附件。
> 测试流程
> 1. ConfigNode机器
> 172.20.70.31(leader) 8核32G
> 172.20.70.32/33 4核16G
> confignode配置参数:
> MAX_HEAP_SIZE="8G"
> MAX_DIRECT_MEMORY_SIZE="4G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> schema_replication_factor=3
> data_replication_factor=3
> 2. DataNode机器
> 172.20.70.2/3/4/5/13/14/16/18/19 8核32G
> 配置参数
> MAX_HEAP_SIZE="20G"
> MAX_DIRECT_MEMORY_SIZE="6G"
> max_connection_for_internal_service=200
> wal_buffer_size_in_byte=1048576
> enable_timed_flush_seq_memtable=true
> seq_memtable_flush_interval_in_ms=3600000
> seq_memtable_flush_check_interval_in_ms=600000
> enable_timed_flush_unseq_memtable=true
> unseq_memtable_flush_interval_in_ms=3600000
> unseq_memtable_flush_check_interval_in_ms=600000
> max_waiting_time_when_insert_blocked=3600000
> query_timeout_threshold=3600000
> 3. benchmark配置见附件
> CLIENT_NUMBER=50
> 运行benchmark。
> 4. 运行down datanode 脚本
> cat down_datanode.sh
> #!/bin/bash
> node1="172.20.70.4"
> node2="172.20.70.5"
> node3="172.20.70.3"
> node4="172.20.70.2"
> node5="172.20.70.13"
> node6="172.20.70.14"
> node7="172.20.70.16"
> node8="172.20.70.18"
> node9="172.20.70.19"
> cluster_dir="/data/iotdb"
> cur_cluster="master_0909_bdd7ca8"
> u_name="cluster"
> function down_datanode()
> {
> t=`date '+%Y-%m-%d %H:%M:%S'`
> echo "${t}"
> node=$1
> ${cluster_dir}/${cur_cluster}/datanode/sbin/start-cli.sh -h ${node} -e "show
> cluster"
> ${cluster_dir}/${cur_cluster}/datanode/sbin/start-cli.sh -h ${node} -e "show
> regions"
> ssh ${u_name}@${node} "source
> /etc/profile;${cluster_dir}/${cur_cluster}/datanode/sbin/stop-datanode.sh"
> sleep 10m
> ssh ${u_name}@${node} "source
> /etc/profile;${cluster_dir}/${cur_cluster}/datanode/sbin/start-datanode.sh >
> /dev/null 2>&1 &"
> sleep 30m
> }
> sleep 30m
> down_datanode ${node1}
> down_datanode ${node2}
> down_datanode ${node3}
> down_datanode ${node4}
> down_datanode ${node5}
> down_datanode ${node6}
> down_datanode ${node7}
> down_datanode ${node8}
> down_datanode ${node9}
> 5. benchmark运行16.25小时
> 6.(70小时后)执行查询 正确性验证
> 停集群前执行查询
> select count(s_5) from root.** align by device;
> 清os 缓存
> 重启集群
> select count(s_5) from root.** align by device;
> 对比2个结果
--
This message was sent by Atlassian Jira
(v8.20.10#820010)