[ 
https://issues.apache.org/jira/browse/HDDS-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354991#comment-17354991
 ] 

Bharat Viswanadham commented on HDDS-5263:
------------------------------------------

SCM went into safe mode and never come out of it after SCM restart. 
|INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one datanode 
reported count is 1, required at least one datanode reported per pipeline count 
is 6|

However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10 
open Ratie(1) pipelines.

 


{code:java}
When SCM Started it has 6 pipelines in open state, we read from DB and get this.
{code:java}
 783833 2021-05-20 18:00:54,613 INFO 
org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total 
pipeline count is 6, pipeline's with at         least one datanode reported 
threshold count is 6

{code}
But once the SCM Ratis server started it will replay logs from Transactioninfo 
last applied Index, so after that I see all pipelines are removed. (might be 
due to close pipeline)

Because this SafeMode rule is not successfully validated, SCM never came out of 
safe mode.

https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to 
consider open pipelines this can work for non-HA, as DB updates immediately 
written to DB. But in HA, we write to DBTransactionBuffer, so lets say 
pipelines are closed but not applied to DB. And now SCM is restarted, first 
PipelineManager is initialized it reads from DB, and get 6 pipeline count, and 
then SCM replays its transaction which removes them if pipeline close happened 
before. Because of this SCM safemode rule cannot be successfully validated.

 
 783875 2021-05-20 18:00:55,963 INFO 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ 
Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes: 
d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6

{ip: xxxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS=9858, RA 
TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, 
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt 
ateExpiryEpochSec: 0}

ea53e24e-3d10-4d41-93c9-a568a1627cca

{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS =9858, 
RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
/default, certSerialId: null, persistedOpState: IN_SERVICE, pers 
istedOpStateExpiryEpochSec: 0}

9416da18-1fc4-4cb3-8200-6a71698c808e

{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858, 
RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
/default, certSerialId: null, persistedOpState: IN_SERV ICE, 
persistedOpStateExpiryEpochSec: 0}

, ReplicationConfig: RATIS/THREE, State:CLOSED, 
leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT 
imestamp2021-05-20T18:00:54.497Z] removed.

783882 2021-05-20 18:00:55,970 INFO 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[ 
Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes: 
8fd99eff-7f50-4b56-ad03-1e796030268d

:

:

{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858, 
RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: 
/default, certSerialId: null, persistedOpState: IN_SERV ICE, 
persistedOpStateExpiryEpochSec: 0}

, ReplicationConfig: RATIS/THREE, State:CLOSED, 
leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT 
imestamp2021-05-20T18:00:54.497Z] removed.
{code}


> SCM may stay in safe mode forever due to incorrect open pipeline count
> ----------------------------------------------------------------------
>
>                 Key: HDDS-5263
>                 URL: https://issues.apache.org/jira/browse/HDDS-5263
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: George Huang
>            Assignee: Bharat Viswanadham
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: SCM HA SafeMode .pdf
>
>
> After an unclean shutdown, SCM may never come out of the safe mode.
> Attached a document to explain the problem and the proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to