slfan1989 commented on PR #7542:
URL: https://github.com/apache/ozone/pull/7542#issuecomment-2563230571

   > > I will conduct a deeper study of the Legacy Replication Manager and then 
follow up with you for further discussion.
   > 
   > The Legacy RM is set to be removed. I'd suggest getting onto the new RM 
which solves a lot of problems with the original design and see how it 
performs. We have scope to make it better if it is not working much faster.
   
   @sodonnel 
   
   Thank you for the suggestion! I reviewed the `ReplicationManager` Related 
code and Configurations and identified two key parameters:
   
   - hdds.scm.replication.inflight.limit.factor
   - hdds.scm.replication.datanode.replication.limit
   
   We have made appropriate adjustments to these parameters based on our 
cluster's situation, and they have proven helpful for us.
   
   However, I have encountered a new issue that may require further 
optimization. 
   
   I found that some machines are unable to enter the decommissioned state.  
Taking bigdata-ozone071 as an example, the analysis process is as follows:
   
   - Step1. Analyze the SCM logs.
   
   tail -f ozone-hadoop-scm-bigdata-ozonemaster.log|grep 
'sufficientlyReplicated'|grep 'bigdata-ozone071'
   
   ```
   2024-12-27 09:10:49,930 [scm20-DatanodeAdminManager-0] INFO 
org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: 
87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx) has 35604 
sufficientlyReplicated, 0 deleting, 
   65 underReplicated and 18 unclosed containers
   ```
   
   - Step2. Open the SCM DatanodeAdminMonitorImpl DEBUG logs.
   
   ```
   2024-12-26 19:48:49,834 [scm20-DatanodeAdminManager-0] DEBUG 
org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: 
   87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx) 
   has 65 underReplicated [#2880767, #1047176, #2882549, #196344, #1048593, 
#2884552, #1049945, #1049979, #1050457, #1050467, #1050603, #1050772, #2951936, 
#1051434, #1051630, #1051875, #1052169, #1052236, #200415, #1052722, #1052786, 
#3477663, #1053006, #1053322, #2953943, #2889201, #2954850, #1055567, #1055648, 
#1056624, #1056767, #1057345, #2892835, #1058334, #1058391, #1058539, #2567657, 
#81316, #2965464, #82312, #2966641, #2967125, #2772683, #86423, #89872, 
#2908872, #2976269, #2910782, #92979, #2978544, #162430, #2915684, #2983752, 
#2987173, #3316167, #2988603, #107475, #173256, #2992090, #2929304, #3001030, 
#2870577, #2870715, #3002129, #2875862] 
   and 18 unclosed [#3536258, #3539106, #3539454, #3541506, #3546006, #3547041, 
#3549110, #2511453, #3561382, #2454668, #3513040, #3515030, #3516597, #3516577, 
#3516647, #3523054, #3529943, #3533790] containers
   ```
   
   - Step3. SCM ozone admin container info 3533790
   
   I found that ReplicaIndex: 8 has 35 unhealthy replicas.
   
   ```
   Container id: 3533790
   Pipeline id: fc36f759-9ba2-426b-aa0e-31ef36c1e90b
   Write PipelineId: 7ae9e4bd-a60c-46ce-919f-8988f4b90771
   Write Pipeline State: CLOSED
   Container State: CLOSED
   Datanodes: [d15329a6-fa24-49b6-ab7a-f4b8ec879032/bigdata-ozone3385,
   0467a91b-9370-4ce8-a072-a43be3d357cd/bigdata-ozone3358,
   ba7af4ce-60b8-49ac-89c7-7566723d9c80/bigdata-ozone3040,
   e65cdf8b-56ee-4f33-992e-e9efaeeb83d1/bigdata-ozone3458,
   1d526c91-150c-45e4-938f-e1d0fef397d9/bigdata-ozone3185,
   b1001902-3e93-4ce0-9682-a95fc760c38b/bigdata-ozone3540,
   ......
   87f51602-740b-4658-aade-0d01525460b7/bigdata-ozone071]
   Replicas: [State: CLOSED; ReplicaIndex: 1; Origin: 
bd44548c-b261-4450-9132-ba356799cddd; Location: 
bd44548c-b261-4450-9132-ba356799cddd/bigdata-ozone2610,
   State: CLOSED; ReplicaIndex: 2; Origin: 
35fe9bbb-369d-44c5-b35e-c0cda545d64f; Location: 
35fe9bbb-369d-44c5-b35e-c0cda545d64f/bigdata-ozone3316,
   State: CLOSED; ReplicaIndex: 3; Origin: 
11af1de2-bf2e-49a6-bc78-37ff4bc0c8b9; Location: 
11af1de2-bf2e-49a6-bc78-37ff4bc0c8b9/bigdata-ozone3342,
   State: CLOSED; ReplicaIndex: 4; Origin: 
2e42b239-eda4-43fb-ad3b-4af5191fd496; Location: 
2e42b239-eda4-43fb-ad3b-4af5191fd496/bigdata-ozone2721,
   State: CLOSED; ReplicaIndex: 5; Origin: 
06b0bfdd-d7c4-427a-ab90-58e680af21d1; Location: 
06b0bfdd-d7c4-427a-ab90-58e680af21d1/bigdata-ozone3006,
   State: CLOSED; ReplicaIndex: 6; Origin: 
b50ce961-3425-43b6-b174-ece51b76f088; Location: 
b50ce961-3425-43b6-b174-ece51b76f088/bigdata-ozone3351,
   State: CLOSED; ReplicaIndex: 7; Origin: 
b2a2658e-d61c-4eab-a701-6defb5ce6f52; Location: 
b2a2658e-d61c-4eab-a701-6defb5ce6f52/bigdata-ozone3552,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
d15329a6-fa24-49b6-ab7a-f4b8ec879032; Location: 
d15329a6-fa24-49b6-ab7a-f4b8ec879032/bigdata-ozone3385,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
0467a91b-9370-4ce8-a072-a43be3d357cd; Location: 
0467a91b-9370-4ce8-a072-a43be3d357cd/bigdata-ozone3358,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
ba7af4ce-60b8-49ac-89c7-7566723d9c80; Location: 
ba7af4ce-60b8-49ac-89c7-7566723d9c80/bigdata-ozone3040,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
e65cdf8b-56ee-4f33-992e-e9efaeeb83d1; Location: 
e65cdf8b-56ee-4f33-992e-e9efaeeb83d1/bigdata-ozone3458,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
1d526c91-150c-45e4-938f-e1d0fef397d9; Location: 
1d526c91-150c-45e4-938f-e1d0fef397d9/bigdata-ozone3185,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
b1001902-3e93-4ce0-9682-a95fc760c38b; Location: 
b1001902-3e93-4ce0-9682-a95fc760c38b/bigdata-ozone3540,
   .......
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
87f51602-740b-4658-aade-0d01525460b7; Location: 
87f51602-740b-4658-aade-0d01525460b7/bigdata-ozone071,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
881d3593-d6d1-4010-8b41-d1205c0551e3; Location: 
881d3593-d6d1-4010-8b41-d1205c0551e3/bigdata-ozone3315,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
e3b3b130-9241-4ad7-beee-9a8015aeccd7; Location: 
e3b3b130-9241-4ad7-beee-9a8015aeccd7/bigdata-ozone2746,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
9d9b4a76-664d-4aa3-9b8e-60ed7016002b; Location: 
9d9b4a76-664d-4aa3-9b8e-60ed7016002b/bigdata-ozone3197,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
335ee024-79d4-4ec5-bd5c-46b4c90aa2d6; Location: 
335ee024-79d4-4ec5-bd5c-46b4c90aa2d6/bigdata-ozone3399,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
fef6e5df-6742-4615-827c-3ebf18136541; Location: 
fef6e5df-6742-4615-827c-3ebf18136541/bigdata-ozone3420,
   State: UNHEALTHY; ReplicaIndex: 8; Origin: 
c5cf4fb3-dc3c-4587-984c-8124d52f895a; Location: 
c5cf4fb3-dc3c-4587-984c-8124d52f895a/bigdata-ozone2780,
   State: CLOSED; ReplicaIndex: 9; Origin: 
d9fc6568-28ea-4070-b12c-ec1f7b3f1e0c; Location: 
d9fc6568-28ea-4070-b12c-ec1f7b3f1e0c/bigdata-ozone3080
   ```
   
   - Step4. Select a specific DN to continue tracking, and I will continue to 
check the issue with bigdata-ozone071.
   
   - dn auditlog
   
   ```
   2024-12-16 09:51:39,121 | INFO  | DNAudit | op=CREATE_CONTAINER | 
{containerID=3533790, containerType=KeyValueContainer} | ret=SUCCESS |
   ```
   
   - dn log
   
   ```
   2024-12-16 09:51:38,710 [nullContainerReplicationThread-1] INFO 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask:
 IN_PROGRESS reconstructECContainersCommand: containerID=3533790, 
replication=rs-6-3-1024k, missingIndexes=[8], 
   
sources={1=bd44548c-b261-4450-9132-ba356799cddd(bigdata-ozone2610/xx.xx.xxx.xx),
 2=35fe9bbb-369d-44c5-b35e-c0cda545d64f(bigdata-ozone3316/xx.xx.xxx.xx), 
3=11af1de2-bf2e-49a6-bc78-37ff4bc0c8b9(bigdata-ozone3342/xx.xx.xxx.xx), 
4=2e42b239-eda4-43fb-ad3b-4af5191fd496(bigdata-ozone2721/xx.xx.xxx.xx), 
5=06b0bfdd-d7c4-427a-ab90-58e680af21d1(bigdata-ozone3006/xx.xx.xx.xx), 
6=b50ce961-3425-43b6-b174-ece51b76f088(bigdata-ozone3351/xx.xx.xx.xx), 
7=b2a2658e-d61c-4eab-a701-6defb5ce6f52(bigdata-ozone3552/10.93.41.17), 
9=d9fc6568-28ea-4070-b12c-ec1f7b3f1e0c(bigdata-ozone3080/xx.xx.xx.xx)}, 
targets={8=87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx)}
   2024-12-16 09:51:39,113 [nullContainerReplicationThread-1] ERROR 
org.apache.hadoop.hdds.scm.XceiverClientGrpc: Failed to execute command 
CreateContainer on the pipeline Pipeline[ Id: 
87f51602-740b-4658-aade-0d01525460b7, Nodes: 
87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx), 
excludedSet: 
87f51602-740b-4658-aade-0d01525460b7(bigdata-ozone071/xx.xx.xxx.xx), 
ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-12-16T09:51:39.070209248+08:00].
   2024-12-16 09:51:39,113 [nullContainerReplicationThread-1] WARN 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator:
 Exception while reconstructing the container 3533790. Cleaning up all the 
recovering containers in the reconstruction process.
   java.io.IOException: java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: 
deadline exceeded after 0.029807841s. [closed=[], open=[[buffered_nanos=331949, 
remote_addr=xx.xx.xxx.xx/xx.xx.xxx.xx:9859]]]
           at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:443)
           at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:342)
           at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:169)
           at 
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
           at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:337)
           at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:318)
           at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createContainer(ContainerProtocolCalls.java:542)
           at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.createRecoveringContainer(ContainerProtocolCalls.java:495)
           at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECContainerOperationClient.createRecoveringContainer(ECContainerOperationClient.java:178)
           at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:190)
           at 
org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
           at 
org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:369)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
           at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
           at java.base/java.lang.Thread.run(Thread.java:833)
   Caused by: java.util.concurrent.ExecutionException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: 
deadline exceeded after 0.029807841s. 
   [closed=[], open=[[buffered_nanos=331949, 
remote_addr=xx.xx.xxx.xx/xx.xx.xx.xx:9859]]]
           at 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
           at 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
           at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:412)
           ... 14 more
   ```
   
   - Step5.  Find the container 3533790 container.
   
   ll 
/data*/ozonedata/hddsdata/hdds/CID-7e95e103-a270-4948-a4a3-4f63a8a60d0c/current/containerDir*|grep
 "3533790"
   
   ```
   
/data5/ozonedata/hddsdata/hdds/CID-7e95e103-a270-4948-a4a3-4f63a8a60d0c/current/containerDir245:
   total 0
   drwxrwxr-x 4 hadoop hadoop 48 Jun 16  2024 125476
   .....
   drwxrwxr-x 4 hadoop hadoop 48 Dec 16 09:51 3533790
   ```
   
   - Step6. Check the status of the container 3533790.
   tree 3533790
   
   ```
   ./3533790
   ├── chunks
   └── metadata
       ├── 3533790.container
       └── db
           ├── block_data.data
           ├── deleted_blocks.data
           ├── delete_txns.data
           └── metadata.data
   ```
   
   I checked the contents of 3533790.container and confirmed that this 
container is unhealthy. The cause of 3533790 was a failure in restructuring the 
EC block, which resulted in an empty container. We identified the failure, but 
were unable to successfully clean up the container.
   
   The optimization measures still need to be added.
   
   cc: @adoroszlai 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to