[ 
https://issues.apache.org/jira/browse/HDDS-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashikant Banerjee updated HDDS-3314:
--------------------------------------
    Description: 
config set before running the command :

"ozone.scm.stale.node.interval": "2m",
 "ozone.scm.dead.node.interval": "4m",
 "hdds.scm.replication.thread.interval": "12s",
 "ozone.scm.container.size": "1GB"

 

steps taken :

1) write a key (less than a block size)

2) shutdown two container replica datanodes.

3) Tried to query container info

Container info command failed .

 

 
{noformat}
ozone debug chunkinfo <KeyUri> 
Failed to execute command cmdType: ReadContainer
{noformat}
 

scm log during that time range :
{noformat}
2020-04-01 10:09:29,665 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for [email protected] (auth:KERBEROS)
2020-04-01 10:09:29,706 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for [email protected] (auth:KERBEROS) for 
protocol=interface 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
2020-04-01 10:09:55,283 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for dn/[email protected] 
(auth:KERBEROS)
2020-04-01 10:09:55,287 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for 
dn/[email protected] (auth:KERBEROS) 
for protocol=interface 
org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol
2020-04-01 10:09:55,474 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Starting Replication 
Monitor Thread.
2020-04-01 10:09:55,486 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 10 milliseconds for processing 33 containers.
2020-04-01 10:10:07,488 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 2 milliseconds for processing 33 containers.
2020-04-01 10:10:17,996 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for dn/[email protected] 
(auth:KERBEROS)
2020-04-01 10:10:18,001 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for 
dn/[email protected] (auth:KERBEROS) 
for protocol=interface 
org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol
2020-04-01 10:10:19,491 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 3 milliseconds for processing 33 containers.
2020-04-01 10:10:31,494 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 2 milliseconds for processing 33 containers.
2020-04-01 10:10:43,495 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 1 milliseconds for processing 33 containers.
2020-04-01 10:10:47,987 ERROR 
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline 
action CLOSE for Pipeline[ Id: 763bd379-a703-4dc0-85c5-bf385cdc0b18, Nodes: 
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1{ip: 172.27.120.0, host: 
quasar-fjgcwr-1.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
certSerialId: null}b60097cf-7dff-44dc-800f-3500dda636f6{ip: 172.27.123.128, 
host: quasar-fjgcwr-4.quasar-fjgcwr.root.hwx.site, networkLocation: 
/default-rack, certSerialId: null}ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 
172.27.12.195, host: quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:b60097cf-7dff-44dc-800f-3500dda636f6, 
CreationTimestamp2020-04-01T10:04:47.723688Z] from datanode 
ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 172.27.12.195, host: 
quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
certSerialId: 12651664310640168}. Reason : ea2322d9-8ede-4f48-a72d-693e809d2b95 
is in candidate state for 61616ms
2020-04-01 10:10:47,988 INFO 
org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying 
pipeline:Pipeline[ Id: 763bd379-a703-4dc0-85c5-bf385cdc0b18, Nodes: 
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1{ip: 172.27.120.0, host: 
quasar-fjgcwr-1.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
certSerialId: null}b60097cf-7dff-44dc-800f-3500dda636f6{ip: 172.27.123.128, 
host: quasar-fjgcwr-4.quasar-fjgcwr.root.hwx.site, networkLocation: 
/default-rack, certSerialId: null}ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 
172.27.12.195, host: quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:b60097cf-7dff-44dc-800f-3500dda636f6, 
CreationTimestamp2020-04-01T10:04:47.723688Z]{noformat}
 

  was:
config set before running the command :

"ozone.scm.stale.node.interval": "2m",
"ozone.scm.dead.node.interval": "4m",
"hdds.scm.replication.thread.interval": "12s",
"ozone.scm.container.size": "1GB"

 

steps taken :

1) write a key (less than a block size)

2) shutdown two container replica datanodes.

3) Tried to query container info

Container info command failed .

 

 
{noformat}
ozone scmcli container info 33 | egrep 'Container|Datanodes'
Failed to execute command cmdType: ReadContainer
{noformat}
 

scm log during that time range :
{noformat}
2020-04-01 10:09:29,665 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for [email protected] (auth:KERBEROS)
2020-04-01 10:09:29,706 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for [email protected] (auth:KERBEROS) for 
protocol=interface 
org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
2020-04-01 10:09:55,283 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for dn/[email protected] 
(auth:KERBEROS)
2020-04-01 10:09:55,287 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for 
dn/[email protected] (auth:KERBEROS) 
for protocol=interface 
org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol
2020-04-01 10:09:55,474 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Starting Replication 
Monitor Thread.
2020-04-01 10:09:55,486 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 10 milliseconds for processing 33 containers.
2020-04-01 10:10:07,488 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 2 milliseconds for processing 33 containers.
2020-04-01 10:10:17,996 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for dn/[email protected] 
(auth:KERBEROS)
2020-04-01 10:10:18,001 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for 
dn/[email protected] (auth:KERBEROS) 
for protocol=interface 
org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol
2020-04-01 10:10:19,491 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 3 milliseconds for processing 33 containers.
2020-04-01 10:10:31,494 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 2 milliseconds for processing 33 containers.
2020-04-01 10:10:43,495 INFO 
org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
Thread took 1 milliseconds for processing 33 containers.
2020-04-01 10:10:47,987 ERROR 
org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline 
action CLOSE for Pipeline[ Id: 763bd379-a703-4dc0-85c5-bf385cdc0b18, Nodes: 
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1{ip: 172.27.120.0, host: 
quasar-fjgcwr-1.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
certSerialId: null}b60097cf-7dff-44dc-800f-3500dda636f6{ip: 172.27.123.128, 
host: quasar-fjgcwr-4.quasar-fjgcwr.root.hwx.site, networkLocation: 
/default-rack, certSerialId: null}ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 
172.27.12.195, host: quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:b60097cf-7dff-44dc-800f-3500dda636f6, 
CreationTimestamp2020-04-01T10:04:47.723688Z] from datanode 
ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 172.27.12.195, host: 
quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
certSerialId: 12651664310640168}. Reason : ea2322d9-8ede-4f48-a72d-693e809d2b95 
is in candidate state for 61616ms
2020-04-01 10:10:47,988 INFO 
org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying 
pipeline:Pipeline[ Id: 763bd379-a703-4dc0-85c5-bf385cdc0b18, Nodes: 
92f73ec3-9ed8-41c8-9103-c4c1b2b365e1{ip: 172.27.120.0, host: 
quasar-fjgcwr-1.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
certSerialId: null}b60097cf-7dff-44dc-800f-3500dda636f6{ip: 172.27.123.128, 
host: quasar-fjgcwr-4.quasar-fjgcwr.root.hwx.site, networkLocation: 
/default-rack, certSerialId: null}ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 
172.27.12.195, host: quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, 
networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, 
State:OPEN, leaderId:b60097cf-7dff-44dc-800f-3500dda636f6, 
CreationTimestamp2020-04-01T10:04:47.723688Z]{noformat}
 


> FIx ContainerOperationClient#readContainer to use Grpc Client to read from 
> datanode
> -----------------------------------------------------------------------------------
>
>                 Key: HDDS-3314
>                 URL: https://issues.apache.org/jira/browse/HDDS-3314
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: SCM, SCM Client
>            Reporter: Nilotpal Nandi
>            Assignee: Sadanand Shenoy
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> config set before running the command :
> "ozone.scm.stale.node.interval": "2m",
>  "ozone.scm.dead.node.interval": "4m",
>  "hdds.scm.replication.thread.interval": "12s",
>  "ozone.scm.container.size": "1GB"
>  
> steps taken :
> 1) write a key (less than a block size)
> 2) shutdown two container replica datanodes.
> 3) Tried to query container info
> Container info command failed .
>  
>  
> {noformat}
> ozone debug chunkinfo <KeyUri> 
> Failed to execute command cmdType: ReadContainer
> {noformat}
>  
> scm log during that time range :
> {noformat}
> 2020-04-01 10:09:29,665 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for [email protected] (auth:KERBEROS)
> 2020-04-01 10:09:29,706 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for [email protected] (auth:KERBEROS) for 
> protocol=interface 
> org.apache.hadoop.hdds.scm.protocol.StorageContainerLocationProtocol
> 2020-04-01 10:09:55,283 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for 
> dn/[email protected] (auth:KERBEROS)
> 2020-04-01 10:09:55,287 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for 
> dn/[email protected] (auth:KERBEROS) 
> for protocol=interface 
> org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol
> 2020-04-01 10:09:55,474 INFO 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Starting Replication 
> Monitor Thread.
> 2020-04-01 10:09:55,486 INFO 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
> Thread took 10 milliseconds for processing 33 containers.
> 2020-04-01 10:10:07,488 INFO 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
> Thread took 2 milliseconds for processing 33 containers.
> 2020-04-01 10:10:17,996 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for 
> dn/[email protected] (auth:KERBEROS)
> 2020-04-01 10:10:18,001 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for 
> dn/[email protected] (auth:KERBEROS) 
> for protocol=interface 
> org.apache.hadoop.ozone.protocol.StorageContainerDatanodeProtocol
> 2020-04-01 10:10:19,491 INFO 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
> Thread took 3 milliseconds for processing 33 containers.
> 2020-04-01 10:10:31,494 INFO 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
> Thread took 2 milliseconds for processing 33 containers.
> 2020-04-01 10:10:43,495 INFO 
> org.apache.hadoop.hdds.scm.container.ReplicationManager: Replication Monitor 
> Thread took 1 milliseconds for processing 33 containers.
> 2020-04-01 10:10:47,987 ERROR 
> org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline 
> action CLOSE for Pipeline[ Id: 763bd379-a703-4dc0-85c5-bf385cdc0b18, Nodes: 
> 92f73ec3-9ed8-41c8-9103-c4c1b2b365e1{ip: 172.27.120.0, host: 
> quasar-fjgcwr-1.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
> certSerialId: null}b60097cf-7dff-44dc-800f-3500dda636f6{ip: 172.27.123.128, 
> host: quasar-fjgcwr-4.quasar-fjgcwr.root.hwx.site, networkLocation: 
> /default-rack, certSerialId: null}ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 
> 172.27.12.195, host: quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, 
> networkLocation: /default-rack, certSerialId: null}, Type:RATIS, 
> Factor:THREE, State:OPEN, leaderId:b60097cf-7dff-44dc-800f-3500dda636f6, 
> CreationTimestamp2020-04-01T10:04:47.723688Z] from datanode 
> ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 172.27.12.195, host: 
> quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
> certSerialId: 12651664310640168}. Reason : 
> ea2322d9-8ede-4f48-a72d-693e809d2b95 is in candidate state for 61616ms
> 2020-04-01 10:10:47,988 INFO 
> org.apache.hadoop.hdds.scm.pipeline.SCMPipelineManager: Destroying 
> pipeline:Pipeline[ Id: 763bd379-a703-4dc0-85c5-bf385cdc0b18, Nodes: 
> 92f73ec3-9ed8-41c8-9103-c4c1b2b365e1{ip: 172.27.120.0, host: 
> quasar-fjgcwr-1.quasar-fjgcwr.root.hwx.site, networkLocation: /default-rack, 
> certSerialId: null}b60097cf-7dff-44dc-800f-3500dda636f6{ip: 172.27.123.128, 
> host: quasar-fjgcwr-4.quasar-fjgcwr.root.hwx.site, networkLocation: 
> /default-rack, certSerialId: null}ea2322d9-8ede-4f48-a72d-693e809d2b95{ip: 
> 172.27.12.195, host: quasar-fjgcwr-7.quasar-fjgcwr.root.hwx.site, 
> networkLocation: /default-rack, certSerialId: null}, Type:RATIS, 
> Factor:THREE, State:OPEN, leaderId:b60097cf-7dff-44dc-800f-3500dda636f6, 
> CreationTimestamp2020-04-01T10:04:47.723688Z]{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to