[
https://issues.apache.org/jira/browse/HDDS-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354991#comment-17354991
]
Bharat Viswanadham commented on HDDS-5263:
------------------------------------------
SCM went into safe mode and never come out of it after SCM restart.
|INFO|SCMSafeModeManager|SCM in safe mode. Pipelines with at least one datanode
reported count is 1, required at least one datanode reported per pipeline count
is 6|
However, at this time, recon shows there are 6 open Ratie(3) pipelines and 10
open Ratie(1) pipelines.
{code:java}
When SCM Started it has 6 pipelines in open state, we read from DB and get this.
{code:java}
783833 2021-05-20 18:00:54,613 INFO
org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total
pipeline count is 6, pipeline's with at least one datanode reported
threshold count is 6
{code}
But once the SCM Ratis server started it will replay logs from Transactioninfo
last applied Index, so after that I see all pipelines are removed. (might be
due to close pipeline)
Because this SafeMode rule is not successfully validated, SCM never came out of
safe mode.
https://issues.apache.org/jira/browse/HDDS-4399 This Jira has taken care to
consider open pipelines this can work for non-HA, as DB updates immediately
written to DB. But in HA, we write to DBTransactionBuffer, so lets say
pipelines are closed but not applied to DB. And now SCM is restarted, first
PipelineManager is initialized it reads from DB, and get 6 pipeline count, and
then SCM replays its transaction which removes them if pipeline close happened
before. Because of this SCM safemode rule cannot be successfully validated.
783875 2021-05-20 18:00:55,963 INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
Id: c79a2082-9cac-4bcf-b303-9beaf84e5 998, Nodes:
d8f40fc5-ea38-4fd2-a588-4aaf9ac544d6
{ip: xxxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS=9858, RA
TIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default,
certSerialId: null, persistedOpState: IN_SERVICE, persistedOpSt
ateExpiryEpochSec: 0}
ea53e24e-3d10-4d41-93c9-a568a1627cca
{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=9886, RATIS =9858,
RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERVICE, pers
istedOpStateExpiryEpochSec: 0}
9416da18-1fc4-4cb3-8200-6a71698c808e
{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858,
RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERV ICE,
persistedOpStateExpiryEpochSec: 0}
, ReplicationConfig: RATIS/THREE, State:CLOSED,
leaderId:9416da18-1fc4-4cb3-8200-6a71698c808e, CreationT
imestamp2021-05-20T18:00:54.497Z] removed.
783882 2021-05-20 18:00:55,970 INFO
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManager: Pipeline Pipeline[
Id: e1b21d65-e80f-4ade-8e78-9bd956183 a7c, Nodes:
8fd99eff-7f50-4b56-ad03-1e796030268d
:
:
{ip: xxx.xx.xx.xxx, host: xxx.com, ports: [REPLICATION=98 86, RATIS=9858,
RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation:
/default, certSerialId: null, persistedOpState: IN_SERV ICE,
persistedOpStateExpiryEpochSec: 0}
, ReplicationConfig: RATIS/THREE, State:CLOSED,
leaderId:ea53e24e-3d10-4d41-93c9-a568a1627cca, CreationT
imestamp2021-05-20T18:00:54.497Z] removed.
{code}
> SCM may stay in safe mode forever due to incorrect open pipeline count
> ----------------------------------------------------------------------
>
> Key: HDDS-5263
> URL: https://issues.apache.org/jira/browse/HDDS-5263
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM HA
> Reporter: George Huang
> Assignee: Bharat Viswanadham
> Priority: Major
> Labels: pull-request-available
> Attachments: SCM HA SafeMode .pdf
>
>
> After an unclean shutdown, SCM may never come out of the safe mode.
> Attached a document to explain the problem and the proposal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]