[
https://issues.apache.org/jira/browse/HDDS-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Huang updated HDDS-5263:
-------------------------------
Description:
SCM went into safe mode and never come out of it after SCM restart.
|INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one datanode
reported count is 1, required at least one datanode reported per pipeline count
is 6|
However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10
open Ratie(1) pipelines.
When SCM Started it has 6 pipelines in open state, we read from DB and get this.
{code:java}
783833 2021-05-20 18:00:54,613 INFO
org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total
pipeline count is 6, pipeline's with at least one datanode reported
threshold count is 6
{code}
But once the SCM Ratis server started it will replay logs from Transactioninfo
last applied Index, so after that I see all pipelines are removed. (might be
due to close pipeline)
Because of this SafeMode rule is not successfully validated.
https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to
consider open pipelines this can work for non-HA, as DB updates immediately
written to DB. But in HA, we write to DBTransactionBuffer, so lets say
pipelines are closed but not applied to DB. And now SCM is restarted, first
PipelineManager is initialized it reads from DB, and get 6 pipeline count, and
then SCM replays its transaction which removes them if pipeline close happened
before. Because of this SCM safemode rule cannot be successfully validated.
783875 2021-05-20 18:00:55,963 INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes:
d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6{ip: 172.26.32.227, host:
y127.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default,
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
ateExpiryEpochSec: 0}ea53e24e-3d10-4d41-93c9-a568a1627cca{ip: 172.26.32.232,
host: y132.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS
=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERVICE, pers
istedOpStateExpiryEpochSec: 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip:
172.26.32.231, host: y131.l42scl.hortonworks.com, ports: [REPLICATION=98
86, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default, certSerialId: null, persistedOpState: IN_SERV
ICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
State:CLOSED, leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT
imestamp2021-05-20T18:00:54.497Z] removed.
783882 2021-05-20 18:00:55,970 INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes:
8fd99eff-7f50-4b56-ad03-1e796030268d{ip: 172.26.32.226, host:
y126.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default,
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
ateExpiryEpochSec: 0}3b0677b5-f418-483d-882c-312cdfd3e4a7{ip: 172.26.32.233,
host: y133.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS
=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERVICE, pers
istedOpStateExpiryEpochSec: 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip:
172.26.32.231, host: y131.l42scl.hortonworks.com, ports: [REPLICATION=98
86, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default, certSerialId: null, persistedOpState: IN_SERV
ICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
State:CLOSED, leaderId:, CreationTimestamp2021-05-20T17:36:27.284Z] removed.
783883 2021-05-20 18:00:55,970 INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
Id: bbe2659d-5c04-4bcd-a3d0-7c40e544d 80b, Nodes:
3b0677b5-f418-483d-882c-312cdfd3e4a7{ip: 172.26.32.233, host:
y133.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default,
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
ateExpiryEpochSec: 0}d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6{ip: 172.26.32.227,
host: y127.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS
=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERVICE, pers
istedOpStateExpiryEpochSec: 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip:
172.26.32.231, host: y131.l42scl.hortonworks.com, ports: [REPLICATION=98
86, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default, certSerialId: null, persistedOpState: IN_SERV
ICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
State:CLOSED, leaderId:, CreationTimestamp2021-05-20T17:36:27.292Z] re
moved.
783876 2021-05-20 18:00:55,963 INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
Id: b2baefae-118a-4a11-bbe2-6f534220f ae4, Nodes:
d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6{ip: 172.26.32.227, host:
y127.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default,
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
ateExpiryEpochSec: 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip: 172.26.32.231,
host: y131.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS
=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERVICE, pers
istedOpStateExpiryEpochSec: 0}ea53e24e-3d10-4d41-93c9-a568a1627cca{ip:
172.26.32.232, host: y132.l42scl.hortonworks.com, ports: [REPLICATION=98
86, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
networkLocation: /default, certSerialId: null, persistedOpState: IN_SERV
ICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
State:CLOSED, leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT
imestamp2021-05-20T18:00:54.497Z] removed.
was:
SCM went into safe mode and never come out of it after SCM restart.
|INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one datanode
reported count is 1, required at least one datanode reported per pipeline count
is 6|
However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10
open Ratie(1) pipelines.
> SCM may stay in safe mode forever
> ---------------------------------
>
> Key: HDDS-5263
> URL: https://issues.apache.org/jira/browse/HDDS-5263
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM HA
> Reporter: George Huang
> Assignee: Bharat Viswanadham
> Priority: Major
>
> SCM went into safe mode and never come out of it after SCM restart.
> |INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one
> datanode reported count is 1, required at least one datanode reported per
> pipeline count is 6|
> However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10
> open Ratie(1) pipelines.
>
> When SCM Started it has 6 pipelines in open state, we read from DB and get
> this.
> {code:java}
> 783833 2021-05-20 18:00:54,613 INFO
> org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total
> pipeline count is 6, pipeline's with at least one datanode reported
> threshold count is 6
> {code}
> But once the SCM Ratis server started it will replay logs from
> Transactioninfo last applied Index, so after that I see all pipelines are
> removed. (might be due to close pipeline)
> Because of this SafeMode rule is not successfully validated.
> https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to
> consider open pipelines this can work for non-HA, as DB updates immediately
> written to DB. But in HA, we write to DBTransactionBuffer, so lets say
> pipelines are closed but not applied to DB. And now SCM is restarted, first
> PipelineManager is initialized it reads from DB, and get 6 pipeline count,
> and then SCM replays its transaction which removes them if pipeline close
> happened before. Because of this SCM safemode rule cannot be successfully
> validated.
>
> 783875 2021-05-20 18:00:55,963 INFO
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
> Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes:
> d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6{ip: 172.26.32.227, host:
> y127.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
> TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
> ateExpiryEpochSec: 0}ea53e24e-3d10-4d41-93c9-a568a1627cca{ip:
> 172.26.32.232, host: y132.l42scl.hortonworks.com, ports: [REPLICATION=9886,
> RATIS =9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE,
> pers istedOpStateExpiryEpochSec:
> 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip: 172.26.32.231, host:
> y131.l42scl.hortonworks.com, ports: [REPLICATION=98 86, RATIS=9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERV ICE,
> persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
> State:CLOSED, leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT
> imestamp2021-05-20T18:00:54.497Z] removed.
> 783882 2021-05-20 18:00:55,970 INFO
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
> Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes:
> 8fd99eff-7f50-4b56-ad03-1e796030268d{ip: 172.26.32.226, host:
> y126.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
> TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
> ateExpiryEpochSec: 0}3b0677b5-f418-483d-882c-312cdfd3e4a7{ip:
> 172.26.32.233, host: y133.l42scl.hortonworks.com, ports: [REPLICATION=9886,
> RATIS =9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE,
> pers istedOpStateExpiryEpochSec:
> 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip: 172.26.32.231, host:
> y131.l42scl.hortonworks.com, ports: [REPLICATION=98 86, RATIS=9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERV ICE,
> persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
> State:CLOSED, leaderId:, CreationTimestamp2021-05-20T17:36:27.284Z] removed.
> 783883 2021-05-20 18:00:55,970 INFO
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
> Id: bbe2659d-5c04-4bcd-a3d0-7c40e544d 80b, Nodes:
> 3b0677b5-f418-483d-882c-312cdfd3e4a7{ip: 172.26.32.233, host:
> y133.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
> TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
> ateExpiryEpochSec: 0}d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6{ip:
> 172.26.32.227, host: y127.l42scl.hortonworks.com, ports: [REPLICATION=9886,
> RATIS =9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE,
> pers istedOpStateExpiryEpochSec:
> 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip: 172.26.32.231, host:
> y131.l42scl.hortonworks.com, ports: [REPLICATION=98 86, RATIS=9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERV ICE,
> persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
> State:CLOSED, leaderId:, CreationTimestamp2021-05-20T17:36:27.292Z] re
> moved.
> 783876 2021-05-20 18:00:55,963 INFO
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
> Id: b2baefae-118a-4a11-bbe2-6f534220f ae4, Nodes:
> d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6{ip: 172.26.32.227, host:
> y127.l42scl.hortonworks.com, ports: [REPLICATION=9886, RATIS=9858, RA
> TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
> ateExpiryEpochSec: 0}9416da18-1fc4-4cb3-8200-6a71698c808e{ip:
> 172.26.32.231, host: y131.l42scl.hortonworks.com, ports: [REPLICATION=9886,
> RATIS =9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859],
> networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE,
> pers istedOpStateExpiryEpochSec:
> 0}ea53e24e-3d10-4d41-93c9-a568a1627cca{ip: 172.26.32.232, host:
> y132.l42scl.hortonworks.com, ports: [REPLICATION=98 86, RATIS=9858,
> RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
> /default, certSerialId: null, persistedOpState: IN_SERV ICE,
> persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/THREE,
> State:CLOSED, leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT
> imestamp2021-05-20T18:00:54.497Z] removed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]