[jira] [Comment Edited] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086799#comment-16086799
 ] 

Yang Jiandan edited comment on HDFS-12132 at 7/14/17 4:36 AM:
--

Thank you for your response. I understand what you mean. ZKFC let NN2 also 
become standby, and the whole HDFS is not available, which leads to Yarn and 
HBase are also not available. so the consequences are too serious. The better 
solution is that the ZKFC throws exception and exits, the most important is 
keeping the Active NameNode state unchanged.


was (Author: yangjiandan):
I understand what you mean. ZKFC let NN2 also become standby, and the whole 
HDFS is not available, which leads to Yarn and HBase are also not available. so 
the consequences are too serious. The better solution is that the ZKFC throws 
exception and exits, the most important is keeping the Active NameNode state 
unchanged.

> Both two NameNodes become Standby because the ZKFC exception
> 
>
> Key: HDFS-12132
> URL: https://issues.apache.org/jira/browse/HDFS-12132
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover
>Affects Versions: 2.8.1
>Reporter: Yang Jiandan
>
> Active NameNode become Standby because the ZKFC exception and Standby 
> NameNode is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
> Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
> exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
> active NameNode.
> Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 
> was standby
> The configuration before upgrading is as follows:
> {code:java}
> dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
> dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
> {code}
> After upgrading, add the configuration of the separate RPC service:
> {code:java}
> dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
> dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
> dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
> dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
> dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
> dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
> {code}
> The upgrade steps are as follows:
> 1. Upgrade NN2: restart NameNode process on NN2
> 2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
> NN1 is standby
> 3. Restart both ZKFC on NN1 and NN2
> After restarting ZKFC,  Active NameNodes have become Standby, and Standby 
> NameNode did not become Active. Two ZKFC having been doing a loop and threw 
> many same exceptions.
> {code:java}
> createLockNodeAsync()  // create lock successfully
> becomeActive()  // return false
> terminateConnection()  // delete EPHEMERAL znode of 
> 'ActiveStandbyElectorLock'  
> sleepFor(sleepTime)
> {code}
> After running command 'hdfs zkfc -formatZK', ZKFC backs to normal. 
> ZKFC Exception log is:
> {code:java}
> 2017-07-11 18:49:44,311 WARN [main-EventThread] 
> org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
> election
> java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
> nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
> namenodeId: "nn2"
> hostname: “nn2_hostname”
> port: 8020
> zkfcPort: 8019
> , address from our own configuration for this NameNode was 
> nn2_hostname/xx.xxx.xx.xxx:8021
> at 
> org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
> at 
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
> at 
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
> at 
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-07-11 18:49:44,311 INFO [main-EventThread] 
> org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
> 2017-07-11 18:49:44,311 INFO [main-EventThread] 
> org.apache.zookeeper.ZooKeeper: Session: 0x15c3ada0ec319aa closed
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Commented] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086799#comment-16086799
 ] 

Yang Jiandan commented on HDFS-12132:
-

I understand what you mean. ZKFC let NN2 also become standby, and the whole 
HDFS is not available, which leads to Yarn and HBase are also not available. so 
the consequences are too serious. The better solution is that the ZKFC throws 
exception and exits, the most important is keeping the Active NameNode state 
unchanged.

> Both two NameNodes become Standby because the ZKFC exception
> 
>
> Key: HDFS-12132
> URL: https://issues.apache.org/jira/browse/HDFS-12132
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover
>Affects Versions: 2.8.1
>Reporter: Yang Jiandan
>
> Active NameNode become Standby because the ZKFC exception and Standby 
> NameNode is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
> Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
> exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
> active NameNode.
> Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 
> was standby
> The configuration before upgrading is as follows:
> {code:java}
> dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
> dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
> {code}
> After upgrading, add the configuration of the separate RPC service:
> {code:java}
> dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
> dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
> dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
> dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
> dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
> dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
> {code}
> The upgrade steps are as follows:
> 1. Upgrade NN2: restart NameNode process on NN2
> 2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
> NN1 is standby
> 3. Restart both ZKFC on NN1 and NN2
> After restarting ZKFC,  Active NameNodes have become Standby, and Standby 
> NameNode did not become Active. Two ZKFC having been doing a loop and threw 
> many same exceptions.
> {code:java}
> createLockNodeAsync()  // create lock successfully
> becomeActive()  // return false
> terminateConnection()  // delete EPHEMERAL znode of 
> 'ActiveStandbyElectorLock'  
> sleepFor(sleepTime)
> {code}
> After running command 'hdfs zkfc -formatZK', ZKFC backs to normal. 
> ZKFC Exception log is:
> {code:java}
> 2017-07-11 18:49:44,311 WARN [main-EventThread] 
> org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
> election
> java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
> nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
> namenodeId: "nn2"
> hostname: “nn2_hostname”
> port: 8020
> zkfcPort: 8019
> , address from our own configuration for this NameNode was 
> nn2_hostname/xx.xxx.xx.xxx:8021
> at 
> org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
> at 
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
> at 
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
> at 
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-07-11 18:49:44,311 INFO [main-EventThread] 
> org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
> 2017-07-11 18:49:44,311 INFO [main-EventThread] 
> org.apache.zookeeper.ZooKeeper: Session: 0x15c3ada0ec319aa closed
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-12132:

Description: 
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC,  Active NameNodes have become Standby, and Standby 
NameNode did not become Active. Two ZKFC having been doing a loop and threw 
many same exceptions.
{code:java}
createLockNodeAsync()  // create lock successfully
becomeActive()  // return false
terminateConnection()  // delete EPHEMERAL znode of 'ActiveStandbyElectorLock'  
sleepFor(sleepTime)
{code}
After running command 'hdfs zkfc -formatZK', ZKFC backs to normal. 
ZKFC Exception log is:
{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-07-11 18:49:44,311 INFO [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: 
Session: 0x15c3ada0ec319aa closed
{code}


  was:
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC,  Active NameNodes have become Standby, and Standby 
NameNode did not become Active. Two ZKFC having been doing a loop and threw 
many same exceptions.
{code:java}
createLockNodeAsync()  // create lock successfully
becomeActive()  // return false

[jira] [Updated] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-12132:

Description: 
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC,  Active NameNodes have become Standby, and Standby 
NameNode did not become Active. Two ZKFC having been doing a loop and threw 
many same exceptions.
{code:java}
createLockNodeAsync()  // create lock successfully
becomeActive()  // return false
terminateConnection()  // delete EPHEMERAL znode of 'ActiveStandbyElectorLock'  
sleepFor(sleepTime)
{code}
After running command 'hdfs zkfc -formatZK', ZKFC backs to normal. 
Exception log is:
{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-07-11 18:49:44,311 INFO [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: 
Session: 0x15c3ada0ec319aa closed
{code}


  was:
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception, and Active NameNodes have 
become Standby,  Standby NameNode did not become Active. After running command 
'hdfs zkfc -formatZK', ZKFC backs to normal. 
Exception log is:
{code:java}
2017-07-11 18:49:44,311 WARN 

[jira] [Updated] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-12132:

Description: 
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception, and Active NameNodes have 
become Standby,  Standby NameNode did not become Active. After running command 
'hdfs zkfc -formatZK', ZKFC backs to normal. 
Exception log is:
{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-07-11 18:49:44,311 INFO [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: 
Session: 0x15c3ada0ec319aa closed
{code}


  was:
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception, and Active NameNodes have 
become Standby,  Standby NameNode did not become Active. exception log is:

{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020

[jira] [Updated] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-12132:

Description: 
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception, and Active NameNodes have 
become Standby,  Standby NameNode did not become Active. exception log is:

{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-07-11 18:49:44,311 INFO [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: 
Session: 0x15c3ada0ec319aa closed
{code}
After run command 'hdfs zkfc -formatZK', ZKFC backs to normal

  was:
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception, and Active NameNodes have 
become Standby,  Standby NameNode did not become Active. exception log is:

{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019

[jira] [Updated] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-12132:

Description: 
Active NameNode become Standby because the ZKFC exception and Standby NameNode 
is still Standby When rolling upgrading Hadoop from Hadoop-2.6.5 to 
Hadoop-2.8.0, this lead HDFS to be not available. The logic of processing 
exception in ZKFC seems to be problematic, ZKFC should guarantee to have a 
active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception, and Active NameNodes have 
become Standby,  Standby NameNode did not become Active. exception log is:

{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-07-11 18:49:44,311 INFO [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: 
Session: 0x15c3ada0ec319aa closed
{code}

  was:
Both two NameNodes become Standby because the ZKFC exception When rolling 
upgrading Hadoop from Hadoop-2.6.5 to Hadoop-2.8.0, this lead HDFS to be not 
available. The logic of processing exception in ZKFC seems to be problematic, 
ZKFC should guarantee to have a active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception and two NameNodes have 
become Standby, exception log is:

{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 

[jira] [Created] (HDFS-12132) Both two NameNodes become Standby because the ZKFC exception

2017-07-13 Thread Yang Jiandan (JIRA)
Yang Jiandan created HDFS-12132:
---

 Summary: Both two NameNodes become Standby because the ZKFC 
exception
 Key: HDFS-12132
 URL: https://issues.apache.org/jira/browse/HDFS-12132
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.8.1
Reporter: Yang Jiandan


Both two NameNodes become Standby because the ZKFC exception When rolling 
upgrading Hadoop from Hadoop-2.6.5 to Hadoop-2.8.0, this lead HDFS to be not 
available. The logic of processing exception in ZKFC seems to be problematic, 
ZKFC should guarantee to have a active NameNode.

Before upgrading, the cluster was deployed with HA, NN1 was active, and NN2 was 
standby
The configuration before upgrading is as follows:

{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
{code}

After upgrading, add the configuration of the separate RPC service:
{code:java}
dfs.namenode.rpc-address.nameservice.nn1 nn1: 8020
dfs.namenode.rpc-address.nameservice.nn2 nn2: 8020
dfs.namenode.servicerpc-address.nameservice.nn1 nn1: 8021
dfs.namenode.servicerpc-address.nameservice.nn2 nn2: 8021
dfs.namenode.lifeline.rpc-address.nameservice.nn1 nn1: 8022
dfs.namenode.lifeline.rpc-address.nameservice.nn2 nn2: 8022
{code}

The upgrade steps are as follows:
1. Upgrade NN2: restart NameNode process on NN2
2. Upgrade NN1: restart the NameNode process on NN1, then NN2 becomes active, 
NN1 is standby
3. Restart both ZKFC on NN1 and NN2

After restarting ZKFC, Two ZKFC threw same exception and two NameNodes have 
become Standby, exception log is:

{code:java}
2017-07-11 18:49:44,311 WARN [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of 
election
java.lang.RuntimeException: Mismatched address stored in ZK for NameNode at 
nn2/xx.xxx.xx.xxx:8022: Stored protobuf was nameserviceId: “nameservice”
namenodeId: "nn2"
hostname: “nn2_hostname”
port: 8020
zkfcPort: 8019
, address from our own configuration for this NameNode was 
nn2_hostname/xx.xxx.xx.xxx:8021
at 
org.apache.hadoop.hdfs.tools.DFSZKFailoverController.dataToTarget(DFSZKFailoverController.java:87)
at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:506)
at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:895)
at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:985)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:882)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:467)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-07-11 18:49:44,311 INFO [main-EventThread] 
org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2017-07-11 18:49:44,311 INFO [main-EventThread] org.apache.zookeeper.ZooKeeper: 
Session: 0x15c3ada0ec319aa closed
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-7145) DFSInputStream does not return when reading

2014-09-25 Thread Yang Jiandan (JIRA)
Yang Jiandan created HDFS-7145:
--

 Summary: DFSInputStream does not return when reading
 Key: HDFS-7145
 URL: https://issues.apache.org/jira/browse/HDFS-7145
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.5.0
Reporter: Yang Jiandan
Priority: Critical


We found that DFSInputStream#read does not return when hbase handlers read 
files from hdfs, and all handlers are in the 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(). jstack is as 
follows:
RS_PARALLEL_SEEK-hadoop474:60020-9 prio=10 tid=0x7f7350be nid=0x1572 
runnable [0x5a9de000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x00039ad6e730 (a sun.nio.ch.Util$2)
- locked 0x00039ad6e320 (a java.util.Collections$UnmodifiableSet)
- locked 0x0002bf480738 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:395)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1023)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:966)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1293)
at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1223)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:392)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:532)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:553)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:237)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
at 
org.apache.hadoop.hbase.regionserver.handler.ParallelSeekHandler.process(ParallelSeekHandler.java:57)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I read HDFS source code and discover:
1. NioInetPeer#in and NioInetPeer#out default timeout value is 0

  NioInetPeer(Socket socket) throws IOException {
this.socket = socket;
this.in = new SocketInputStream(socket.getChannel(), 0);
this.out = new SocketOutputStream(socket.getChannel(), 0);
this.isLocal = socket.getInetAddress().equals(socket.getLocalAddress());
  }
and result in SocketIOWithTimeout#timeout=0
2. BlockReaderPeer#peer does not set ReadTimeout and WriteTimeout
which lead to 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(timeout=0) and 
does not return. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7145) DFSInputStream does not return when reading

2014-09-25 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-7145:
---
Description: 
We found that DFSInputStream#read does not return when hbase handlers read 
files from hdfs, and all handlers are in the 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(). jstack is as 
follows:
RS_PARALLEL_SEEK-hadoop474:60020-9 prio=10 tid=0x7f7350be nid=0x1572 
runnable [0x5a9de000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x00039ad6e730 (a sun.nio.ch.Util$2)
- locked 0x00039ad6e320 (a java.util.Collections$UnmodifiableSet)
- locked 0x0002bf480738 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:395)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1023)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:966)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1293)
at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1223)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:392)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:532)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:553)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:237)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
at 
org.apache.hadoop.hbase.regionserver.handler.ParallelSeekHandler.process(ParallelSeekHandler.java:57)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I read HDFS source code and discover:
1. NioInetPeer#in and NioInetPeer#out default timeout value is 0
{code:xml}
  NioInetPeer(Socket socket) throws IOException {
this.socket = socket;
this.in = new SocketInputStream(socket.getChannel(), 0);
this.out = new SocketOutputStream(socket.getChannel(), 0);
this.isLocal = socket.getInetAddress().equals(socket.getLocalAddress());
  }

  public SocketInputStream(ReadableByteChannel channel, long timeout)
throws IOException {
SocketIOWithTimeout.checkChannelValidity(channel);
reader = new Reader(channel, timeout);
  }

Reader(ReadableByteChannel channel, long timeout) throws IOException {
  super((SelectableChannel)channel, timeout);
  this.channel = channel;
}

  SocketIOWithTimeout(SelectableChannel channel, long timeout) 
 throws IOException {
checkChannelValidity(channel);

this.channel = channel;
this.timeout = timeout;
// Set non-blocking
channel.configureBlocking(false);
  }
{code}
and 

[jira] [Updated] (HDFS-7145) DFSInputStream does not return when reading

2014-09-25 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-7145:
---
Description: 
We found that DFSInputStream#read does not return when hbase handlers read 
files from hdfs, and all handlers are in the 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(). jstack is as 
follows:
RS_PARALLEL_SEEK-hadoop474:60020-9 prio=10 tid=0x7f7350be nid=0x1572 
runnable [0x5a9de000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x00039ad6e730 (a sun.nio.ch.Util$2)
- locked 0x00039ad6e320 (a java.util.Collections$UnmodifiableSet)
- locked 0x0002bf480738 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:395)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1023)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:966)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1293)
at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1223)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:392)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:532)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:553)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:237)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
at 
org.apache.hadoop.hbase.regionserver.handler.ParallelSeekHandler.process(ParallelSeekHandler.java:57)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I read HDFS source code and discover:
1. NioInetPeer#in and NioInetPeer#out default timeout value is 0
{code:xml}
  NioInetPeer(Socket socket) throws IOException {
this.socket = socket;
this.in = new SocketInputStream(socket.getChannel(), 0);
this.out = new SocketOutputStream(socket.getChannel(), 0);
this.isLocal = socket.getInetAddress().equals(socket.getLocalAddress());
  }

  public SocketInputStream(ReadableByteChannel channel, long timeout)
throws IOException {
SocketIOWithTimeout.checkChannelValidity(channel);
reader = new Reader(channel, timeout);
  }

Reader(ReadableByteChannel channel, long timeout) throws IOException {
  super((SelectableChannel)channel, timeout);
  this.channel = channel;
}

  SocketIOWithTimeout(SelectableChannel channel, long timeout) 
 throws IOException {
checkChannelValidity(channel);

this.channel = channel;
this.timeout = timeout;
// Set non-blocking
channel.configureBlocking(false);
  }
{code}
and 

[jira] [Updated] (HDFS-7145) DFSInputStream does not return when reading

2014-09-25 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-7145:
---
Description: 
We found that DFSInputStream#read does not return when hbase handlers read 
files from hdfs, and all handlers are in the 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(). jstack is as 
follows:
RS_PARALLEL_SEEK-hadoop474:60020-9 prio=10 tid=0x7f7350be nid=0x1572 
runnable [0x5a9de000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked 0x00039ad6e730 (a sun.nio.ch.Util$2)
- locked 0x00039ad6e320 (a java.util.Collections$UnmodifiableSet)
- locked 0x0002bf480738 (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:395)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1023)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:966)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1293)
at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1223)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:392)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:532)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:553)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:237)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
at 
org.apache.hadoop.hbase.regionserver.handler.ParallelSeekHandler.process(ParallelSeekHandler.java:57)
at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I read HDFS source code and discover:
1. NioInetPeer#in and NioInetPeer#out default timeout value is 0
{code:xml}
  NioInetPeer(Socket socket) throws IOException {
this.socket = socket;
this.in = new SocketInputStream(socket.getChannel(), 0);
this.out = new SocketOutputStream(socket.getChannel(), 0);
this.isLocal = socket.getInetAddress().equals(socket.getLocalAddress());
  }

  public SocketInputStream(ReadableByteChannel channel, long timeout)
throws IOException {
SocketIOWithTimeout.checkChannelValidity(channel);
reader = new Reader(channel, timeout);
  }

Reader(ReadableByteChannel channel, long timeout) throws IOException {
  super((SelectableChannel)channel, timeout);
  this.channel = channel;
}

  SocketIOWithTimeout(SelectableChannel channel, long timeout) 
 throws IOException {
checkChannelValidity(channel);

this.channel = channel;
this.timeout = timeout;
// Set non-blocking
channel.configureBlocking(false);
  }
{code}
and 

[jira] [Updated] (HDFS-7145) DFSInputStream does not return when reading

2014-09-25 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-7145:
---
Attachment: HDFS-7145.patch

 DFSInputStream does not return when reading
 ---

 Key: HDFS-7145
 URL: https://issues.apache.org/jira/browse/HDFS-7145
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.5.0
Reporter: Yang Jiandan
Priority: Critical
 Attachments: HDFS-7145.patch


 We found that DFSInputStream#read does not return when hbase handlers read 
 files from hdfs, and all handlers are in the 
 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(). jstack is as 
 follows:
 RS_PARALLEL_SEEK-hadoop474:60020-9 prio=10 tid=0x7f7350be 
 nid=0x1572 runnable [0x5a9de000]
java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
 at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
 at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 - locked 0x00039ad6e730 (a sun.nio.ch.Util$2)
 - locked 0x00039ad6e320 (a 
 java.util.Collections$UnmodifiableSet)
 - locked 0x0002bf480738 (a sun.nio.ch.EPollSelectorImpl)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at 
 org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at 
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
 at 
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at 
 org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
 at 
 org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:395)
 at 
 org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
 at 
 org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
 at 
 org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1023)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:966)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1293)
 at 
 org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1223)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:392)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:532)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:553)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:237)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
 at 
 org.apache.hadoop.hbase.regionserver.handler.ParallelSeekHandler.process(ParallelSeekHandler.java:57)
 at 
 org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 I read HDFS source code and discover:
 1. NioInetPeer#in and NioInetPeer#out default timeout value is 0
 {code:xml}
   NioInetPeer(Socket socket) throws IOException {
 this.socket = socket;
 this.in = new SocketInputStream(socket.getChannel(), 0);
 this.out = new SocketOutputStream(socket.getChannel(), 0);
 this.isLocal = socket.getInetAddress().equals(socket.getLocalAddress());
   }
   public SocketInputStream(ReadableByteChannel channel, long timeout)
 throws IOException {
 

[jira] [Commented] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-25 Thread Yang Jiandan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147574#comment-14147574
 ] 

Yang Jiandan commented on HDFS-6999:


@Jianshi Huang your problem is not the same with us although the phenomenon is 
that all hbase handlers  are blocked, a good new is we also have the same 
question and  have resolved it. now I have attach a patch file inHDFS-7145 . 
Details is in https://issues.apache.org/jira/browse/HDFS-7145

 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop. 
 the following while never break.
 {code:xml}
 while (buf.remaining()  0) {
   int n = ch.read(buf);
   if (n  0) {
 throw new IOException(Premature EOF reading from  + ch);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-08 Thread Yang Jiandan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125275#comment-14125275
 ] 

Yang Jiandan commented on HDFS-6999:


We can't reproduce stably and don't know exactly the particular combination 
now. In our Configuration dfs.datanode.transferTo.allowed= ture, So We doube 
BlockSender may only sends the head of packet and don't send the data part of 
the packet because of some reasons. 

 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop. 
 the following while never break.
 {code:xml}
 while (buf.remaining()  0) {
   int n = ch.read(buf);
   if (n  0) {
 throw new IOException(Premature EOF reading from  + ch);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)
Yang Jiandan created HDFS-6999:
--

 Summary: PacketReceiver#readChannelFully is in an infinite loop
 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-6999:
---
Description: In our cluster, we found hbase handler may be never return 
when it reads hdfs file using RemoteBlockReader2, and the hander thread occupys 
100% cup. wo found this is because 

 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122300#comment-14122300
 ] 

Yang Jiandan commented on HDFS-6999:


the stack is :
regionserver60020-largeCompactions-1409055324582 daemon prio=10 
tid=0x01080800 nid=0x2c7c runnable [0x601cb000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.readByteBufferDirect0(Native 
Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$400(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainChannel.read(DomainSocket.java:628)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:173)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
- locked 0x00047c41f7e0 (a 
org.apache.hadoop.hdfs.RemoteBlockReader2)
at 
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:682)
at 
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:738)
- locked 0x0004aaceca60 (a org.apache.hadoop.hdfs.DFSInputStream)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:795)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
- locked 0x0004aaceca60 (a org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:563)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1215)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1430)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1312)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:392)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.readNextDataBlock(HFileReaderV2.java:643)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.next(HFileReaderV2.java:757)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:136)
at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:507)
at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:217)
at 
org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:76)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1086)
at 
org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1480)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

   Locked ownable synchronizers:
- 0x00049e162b60 (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
- 0x0005974a84f0 (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
- 0x00065e45cf58 (a 
java.util.concurrent.ThreadPoolExecutor$Worker)

 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-6999:
---
Description: In our cluster, we found hbase handler may be never return 
when it reads hdfs file using RemoteBlockReader2, and the hander thread occupys 
100% cup. wo found this is because PacketReceiver#readChannelFully is in an 
infinite loop.  (was: In our cluster, we found hbase handler may be never 
return when it reads hdfs file using RemoteBlockReader2, and the hander thread 
occupys 100% cup. wo found this is because )

 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-6999:
---
Description: 
In our cluster, we found hbase handler may be never return when it reads hdfs 
file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo found 
this is because PacketReceiver#readChannelFully is in an infinite loop. the 
following while never break.

while (buf.remaining()  0) {
  int n = ch.read(buf);
  if (n  0) {
throw new IOException(Premature EOF reading from  + ch);
  }
}

  was:
In our cluster, we found hbase handler may be never return when it reads hdfs 
file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo found 
this is because PacketReceiver#readChannelFully is in an infinite loop. the 
following while never break.
while (buf.remaining()  0) {
  int n = ch.read(buf);
  if (n  0) {
throw new IOException(Premature EOF reading from  + ch);
  }
}


 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop. 
 the following while never break.
 while (buf.remaining()  0) {
   int n = ch.read(buf);
   if (n  0) {
 throw new IOException(Premature EOF reading from  + ch);
   }
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-6999:
---
Description: 
In our cluster, we found hbase handler may be never return when it reads hdfs 
file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo found 
this is because PacketReceiver#readChannelFully is in an infinite loop. the 
following while never break.
while (buf.remaining()  0) {
  int n = ch.read(buf);
  if (n  0) {
throw new IOException(Premature EOF reading from  + ch);
  }
}

  was:In our cluster, we found hbase handler may be never return when it reads 
hdfs file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
found this is because PacketReceiver#readChannelFully is in an infinite loop.


 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop. 
 the following while never break.
 while (buf.remaining()  0) {
   int n = ch.read(buf);
   if (n  0) {
 throw new IOException(Premature EOF reading from  + ch);
   }
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6999) PacketReceiver#readChannelFully is in an infinite loop

2014-09-04 Thread Yang Jiandan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jiandan updated HDFS-6999:
---
Description: 
In our cluster, we found hbase handler may be never return when it reads hdfs 
file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo found 
this is because PacketReceiver#readChannelFully is in an infinite loop. the 
following while never break.
{code:xml}
while (buf.remaining()  0) {
  int n = ch.read(buf);
  if (n  0) {
throw new IOException(Premature EOF reading from  + ch);
  }
}
{code}

  was:
In our cluster, we found hbase handler may be never return when it reads hdfs 
file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo found 
this is because PacketReceiver#readChannelFully is in an infinite loop. the 
following while never break.

while (buf.remaining()  0) {
  int n = ch.read(buf);
  if (n  0) {
throw new IOException(Premature EOF reading from  + ch);
  }
}


 PacketReceiver#readChannelFully is in an infinite loop
 --

 Key: HDFS-6999
 URL: https://issues.apache.org/jira/browse/HDFS-6999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 2.4.1
Reporter: Yang Jiandan
Priority: Critical

 In our cluster, we found hbase handler may be never return when it reads hdfs 
 file using RemoteBlockReader2, and the hander thread occupys 100% cup. wo 
 found this is because PacketReceiver#readChannelFully is in an infinite loop. 
 the following while never break.
 {code:xml}
 while (buf.remaining()  0) {
   int n = ch.read(buf);
   if (n  0) {
 throw new IOException(Premature EOF reading from  + ch);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-3718) Datanode won't shutdown because of runaway DataBlockScanner thread

2013-08-18 Thread Yang Jiandan (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743494#comment-13743494
 ] 

Yang Jiandan commented on HDFS-3718:


Why interrupt is missing?
I found every place to throw InterruptedException has called Thread.interrupt 
to set interrupt flag。

 Datanode won't shutdown because of runaway DataBlockScanner thread
 --

 Key: HDFS-3718
 URL: https://issues.apache.org/jira/browse/HDFS-3718
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.1-alpha
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Critical
 Fix For: 0.23.3, 2.0.2-alpha

 Attachments: hdfs-3718.patch.txt


 Datanode sometimes does not shutdown because the block pool scanner thread 
 keeps running. It prints out Starting a new period every five seconds, even 
 after {{shutdown()}} is called.  Somehow the interrupt is missed.
 {{DataBlockScanner}} will also terminate if {{datanode.shouldRun}} is false, 
 but in {{DataNode#shutdown}}, {{DataBlockScanner#shutdown()}} is invoked 
 before it is being set to false.
 Is there any reason why {{datanode.shouldRun}} is set to false later? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira