[ 
https://issues.apache.org/jira/browse/IOTDB-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinrui Zhang reassigned IOTDB-4652:
-----------------------------------

    Assignee: 张洪胤  (was: Jinrui Zhang)

Please take a look at this issue.

 

Because there are some errors during this test, we need to confirm whether the 
data lost is led by these errors or not. 

So let's start a new test to see whether this issue will be reproduced or not.

 

On the other hand, we can investigate the log for current env to try to find 
the reason for data lost

> [ MultiLeaderConsensus ] The data on the replicas is inconsistent
> -----------------------------------------------------------------
>
>                 Key: IOTDB-4652
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4652
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: 张洪胤
>            Priority: Major
>         Attachments: image-2022-10-14-16-04-28-847.png, 
> image-2022-10-14-16-13-37-165.png
>
>
> {color:#DE350B}colored text{color}master_1013_00dc222
> schema : ratis
> data : multiLeader
> 3副本,3C3D
> bm写入完成(显示全成功),flush。
> 查询数据,{color:#DE350B}*副本间数据不一致*{color}。
> 查询ip68(最后的状态:此DataRegion[66]的leader),
> ./sbin/start-cli.sh -h 192.168.10.68 -e "select count(s_0) from 
> root.test.g_13.d_1013"
> 少了6个点数据
>  !image-2022-10-14-16-04-28-847.png! 
> 分析ip68/ip62/ip66 此root.test.g_13.d_1013设备的数据
> ip68:94个点,少6个点
> ip62:100个点,正确
> ip66:100个点,正确
> ip66做过leader(直接写入数据较少),ip66 
> 往ip68同步此region的数据时,有ERROR({color:#DE350B}*疑问:如果有不可避免的同步失败,后续还会同步吗*{color}):
> 2022-10-14 10:55:02,593 [pool-96-IoTDB-LogDispatcher-DataRegion[66]-2] ERROR 
> o.a.i.c.m.l.LogDispatcher$LogDispatcherThread:415 - Can not sync logs to peer 
> Peer{groupId=DataRegion[66], endpoint=TEndPoint(ip:192.168.10.68, 
> port:40010)} because
> java.io.IOException: Borrow client from pool for node 
> TEndPoint(ip:192.168.10.68, port:40010) failed.
>         at 
> org.apache.iotdb.commons.client.ClientManager.borrowClient(ClientManager.java:61)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.sendBatchAsync(LogDispatcher.java:404)
>         at 
> org.apache.iotdb.consensus.multileader.logdispatcher.LogDispatcher$LogDispatcherThread.run(LogDispatcher.java:289)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.NoSuchElementException: Timeout waiting for idle object, 
> borrowMaxWaitMillis=10000
>         at 
> org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:453)
>         at 
> org.apache.commons.pool2.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:350)
>         at 
> org.apache.iotdb.commons.client.ClientManager.borrowClient(ClientManager.java:50)
>         ... 7 common frames omitted
> 还需要注意ip66有个ratis 堆外内存检测到泄露的error
> 2022-10-14 10:39:26,022 [grpc-default-worker-ELG-3-40] ERROR 
> o.a.r.t.i.n.u.ResourceLeakDetector:319 - LEAK: ByteBuf.release() was not 
> called before it's garbage-collected. See 
> https://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records:
> Created at:
>         
> org.apache.ratis.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:401)
>         
> org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
>         
> org.apache.ratis.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
>         
> org.apache.ratis.thirdparty.io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53)
>         
> org.apache.ratis.thirdparty.io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
>         
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75)
>         
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:780)
>         
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>         
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>         
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>         
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         java.lang.Thread.run(Thread.java:748)
> 测试环境
> 1. 192.168.10.62/66/68   物理机 72cpu 256GB
> bm在ip64 配置见附件
> ConfigNode 
> MAX_HEAP_SIZE="16G"
> MAX_DIRECT_MEMORY_SIZE="8G"
>  
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> schema_replication_factor=3
> data_replication_factor=3
> connection_timeout_ms=1200000
> DataNode
> MAX_HEAP_SIZE="192G"
> MAX_DIRECT_MEMORY_SIZE="32G"
> connection_timeout_ms=1200000
> max_waiting_time_when_insert_blocked=3600000
> query_timeout_threshold=36000000
> enable_auto_create_schema=false
> 2. bm写入
> 配置见附件
>  !image-2022-10-14-16-13-37-165.png! 
> 3. 查询,验证数据正确性,分析结果,分析集群日志。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to