[ 
https://issues.apache.org/jira/browse/IOTDB-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinrui Zhang reassigned IOTDB-4731:
-----------------------------------

    Assignee: Haiming Zhu  (was: Jinrui Zhang)

> [Remove-DataNode] Data is inconsistent  ( remove datanode before the 
> synchronization is complete )
> --------------------------------------------------------------------------------------------------
>
>                 Key: IOTDB-4731
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4731
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Haiming Zhu
>            Priority: Major
>         Attachments: aft_set_readonlyip68_regions.out, 
> bef_set_ip68readonly_regions.out, image-2022-10-24-15-13-45-086.png, 
> image-2022-10-24-15-22-17-729.png, 
> ip64-is-newpeer_leader_g_4_q_after_remove.out, ip66_g_4_q_after_remove.out, 
> more_ts.conf
>
>
> master_1023_2fea011
> 3rep , 3C3D  ,benchmark  write done .
> Start the fourth datanode (ip64).
> ip68 : SET SYSTEM TO READONLY ON LOCAL
> remove datanode(ip68).
> before remove , ip68 is DataRegion[14]' Leader  , there is unsynchronized 
> data:
>  !image-2022-10-24-15-13-45-086.png! 
> When ip68 is in the removing state , datanode error log :
> 2022-10-24 14:18:18,092 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR 
> o.a.i.d.w.n.WALNode$PlanNodeIterator:590 - Fail to read wal from wal file 
> /data/liuzhen_test/master_1023_2fea011/sbin/../data/datanode/wal/root.test.g_4-14/_150-200-1.wal,
>  skip this file.
> java.nio.channels.ClosedByInterruptException: null
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>         at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:315)
>         at 
> org.apache.iotdb.db.wal.io.WALByteBufReader.<init>(WALByteBufReader.java:47)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:552)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2022-10-24 14:18:18,093 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR 
> o.a.i.d.w.n.WALNode$PlanNodeIterator:590 - Fail to read wal from wal file 
> /data/liuzhen_test/master_1023_2fea011/sbin/../data/datanode/wal/root.test.g_4-14/_151-204-1.wal,
>  skip this file.
> java.nio.channels.ClosedByInterruptException: null
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>         at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:315)
>         at 
> org.apache.iotdb.db.wal.io.WALByteBufReader.<init>(WALByteBufReader.java:47)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:552)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2022-10-24 14:18:18,093 [pool-49-IoTDB-LogDispatcher-DataRegion[14]-3] ERROR 
> o.a.i.c.m.l.LogDispatcher$LogDispatcherThread:294 - Unexpected error in 
> logDispatcher for peer Peer{groupId=DataRegion[14], 
> endpoint=TEndPoint(ip:192.168.10.64, port:40010), nodeId=6}
> java.lang.ArrayIndexOutOfBoundsException: 29
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:530)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.hasNext(WALNode.java:597)
>         at 
> org.apache.iotdb.db.wal.node.WALNode$PlanNodeIterator.next(WALNode.java:683)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.constructBatchFromWAL(LogDispatcher.java:438)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.getBatch(LogDispatcher.java:348)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:274)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> After ip68 removed , Query DataRegion[14] :
> "select count(s_0) from root.test.g_4.** align by device"
> ip64( DataRegion[14] Leader )  :  100 points/sensor
> stop ip64 datanode
> "select count(s_0) from root.test.g_4.** align by device"
> ip66 (48 rows < 100 points ): 
>  !image-2022-10-24-15-22-17-729.png! 
> Test environment
> 1. 192.168.10.62 / 66 /68  72CPU 256GB
> benchmark : ip64 /data/liuzhen_test/weektest/benchmark_tool
> ConfigNode
> MAX_HEAP_SIZE="8G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> schema_replication_factor=3
> data_replication_factor=3
> DataNode
> MAX_HEAP_SIZE="192G"
> MAX_DIRECT_MEMORY_SIZE="32G"
> query_timeout_threshold=36000000
> 2. benchmark configuration
> see attachment .
> 3. after benchmark write done
> ip68 cli "flush"
> ip68 cli : SET SYSTEM TO READONLY ON LOCAL
> start ip64 datanode .
> remove-datanode.sh    "ip68's NodeID"
> 4. View ip68 datanode logs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to