[jira] [Commented] (IOTDB-4389) [MultiLeaderConsensus] Stop 1 datanode at a time for 10 minutes ，after cluster restarts, some timeseries have more data and some have less data

Jinrui Zhang (Jira) Tue, 13 Sep 2022 05:07:04 -0700


    [ 
https://issues.apache.org/jira/browse/IOTDB-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17603540#comment-17603540
 ]


Jinrui Zhang commented on IOTDB-4389:
-------------------------------------

Before repro of this case, I have two assumptions regarding to data lost.
 # If the leader is on the DataNode which is down, the write operations from BM 
fails
 # WAL is not in `Sync` mode, which will lead to data lost when restarting of 
DataNode

> [MultiLeaderConsensus] Stop 1 datanode at a time for 10 minutes ，after 
> cluster restarts, some timeseries have more data and some have less data
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IOTDB-4389
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4389
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Jinrui Zhang
>            Priority: Major
>         Attachments: 0913_after_cluster_restart_query.out, 
> 0913_bef_stop-cluster_with_autoflush_query.out, benchmark_down_datanode.conf, 
> down_datanode.sh, image-2022-09-13-11-09-33-609.png, screenshot-1.png
>
>
> master_0909_bdd7ca8 ， 3C9D
> 9个datanode，间隔30分钟，执行 1次下线 1个datanode，下线时间为10分钟。
> 9个datanode均下线1次后，不再执行故障操作。
> 客户端运行完成 
> 进程退出，（70小时后）执行重启前后的数据正确性验证，{color:#DE350B}*部分序列重启后，有的丢失10个点数据，有的多10个点数据。*{color}：
>  !image-2022-09-13-11-09-33-609.png! 
> {color:#DE350B}*下图红框多10个点数据：*{color}
>  !screenshot-1.png! 
> 详细查询结果见附件。
> 测试流程
> 1. ConfigNode机器
> 172.20.70.31（leader） 8核32G
> 172.20.70.32/33 4核16G
> confignode配置参数：
> MAX_HEAP_SIZE="8G"
> MAX_DIRECT_MEMORY_SIZE="4G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> schema_replication_factor=3
> data_replication_factor=3
> 2. DataNode机器
> 172.20.70.2/3/4/5/13/14/16/18/19 8核32G
> 配置参数
> MAX_HEAP_SIZE="20G"
> MAX_DIRECT_MEMORY_SIZE="6G"
> max_connection_for_internal_service=200
> wal_buffer_size_in_byte=1048576
> enable_timed_flush_seq_memtable=true
> seq_memtable_flush_interval_in_ms=3600000
> seq_memtable_flush_check_interval_in_ms=600000
> enable_timed_flush_unseq_memtable=true
> unseq_memtable_flush_interval_in_ms=3600000
> unseq_memtable_flush_check_interval_in_ms=600000
> max_waiting_time_when_insert_blocked=3600000
> query_timeout_threshold=3600000
> 3. benchmark配置见附件
> CLIENT_NUMBER=50
> 运行benchmark。
> 4. 运行down datanode 脚本
> cat down_datanode.sh
> #!/bin/bash
> node1="172.20.70.4"
> node2="172.20.70.5"
> node3="172.20.70.3"
> node4="172.20.70.2"
> node5="172.20.70.13"
> node6="172.20.70.14"
> node7="172.20.70.16"
> node8="172.20.70.18"
> node9="172.20.70.19"
> cluster_dir="/data/iotdb"
> cur_cluster="master_0909_bdd7ca8"
> u_name="cluster"
> function down_datanode()
> {
> t=`date '+%Y-%m-%d %H:%M:%S'`
> echo "${t}"
> node=$1
> ${cluster_dir}/${cur_cluster}/datanode/sbin/start-cli.sh -h ${node} -e "show 
> cluster"
> ${cluster_dir}/${cur_cluster}/datanode/sbin/start-cli.sh -h ${node} -e "show 
> regions"
> ssh ${u_name}@${node} "source 
> /etc/profile;${cluster_dir}/${cur_cluster}/datanode/sbin/stop-datanode.sh"
> sleep 10m
> ssh ${u_name}@${node} "source 
> /etc/profile;${cluster_dir}/${cur_cluster}/datanode/sbin/start-datanode.sh > 
> /dev/null 2>&1 &"
> sleep 30m
> }
> sleep 30m
> down_datanode ${node1}
> down_datanode ${node2}
> down_datanode ${node3}
> down_datanode ${node4}
> down_datanode ${node5}
> down_datanode ${node6}
> down_datanode ${node7}
> down_datanode ${node8}
> down_datanode ${node9}
> 5. benchmark运行16.25小时
> 6.（70小时后）执行查询 正确性验证
> 停集群前执行查询
> select count(s_5) from root.** align by device；
> 清os 缓存
> 重启集群
> select count(s_5) from root.** align by device；
> 对比2个结果



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IOTDB-4389) [MultiLeaderConsensus] Stop 1 datanode at a time for 10 minutes ，after cluster restarts, some timeseries have more data and some have less data

Reply via email to