[jira] [Commented] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-09 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044171#comment-16044171
 ] 

JiangJiafu commented on ZOOKEEPER-2800:
---

I believe this PR is the same with ZOOKEEPER-2355, thank you for your reminding 
[~rakeshr]. I will use the patch provided in ZOOKEEPER-2355, and see whether 
the PR will happen again. I hope the patch can work fine.

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper3.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2355) Ephemeral node is never deleted if follower fails while reading the proposal packet

2017-06-09 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044155#comment-16044155
 ] 

JiangJiafu commented on ZOOKEEPER-2355:
---

Can this bug be fixed in 3.4.11??? As I know the consistency is the most 
important property of ZooKeeper, so I think this bug has higher priority than 
many others. 
Hope it can be fixed soon.

> Ephemeral node is never deleted if follower fails while reading the proposal 
> packet
> ---
>
> Key: ZOOKEEPER-2355
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum, server
>Reporter: Mohammad Arshad
>Assignee: Mohammad Arshad
>Priority: Critical
> Fix For: 3.5.4, 3.6.0
>
> Attachments: ZOOKEEPER-2355-01.patch, ZOOKEEPER-2355-02.patch, 
> ZOOKEEPER-2355-03.patch, ZOOKEEPER-2355-04.patch, ZOOKEEPER-2355-05.patch
>
>
> ZooKeeper ephemeral node is never deleted if follower fail while reading the 
> proposal packet
> The scenario is as follows:
> # Configure three node ZooKeeper cluster, lets say nodes are A, B and C, 
> start all, assume A is leader, B and C are follower
> # Connect to any of the server and create ephemeral node /e1
> # Close the session, ephemeral node /e1 will go for deletion
> # While receiving delete proposal make Follower B to fail with 
> {{SocketTimeoutException}}. This we need to do to reproduce the scenario 
> otherwise in production environment it happens because of network fault.
> # Remove the fault, just check that faulted Follower is now connected with 
> quorum
> # Connect to any of the server, create the same ephemeral node /e1, created 
> is success.
> # Close the session,  ephemeral node /e1 will go for deletion
> # {color:red}/e1 is not deleted from the faulted Follower B, It should have 
> been deleted as it was again created with another session{color}
> # {color:green}/e1 is deleted from Leader A and other Follower C{color}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-09 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044091#comment-16044091
 ] 

JiangJiafu commented on ZOOKEEPER-2800:
---

I found that, the first time the follower try to reconnect to the leader, it 
sends the peerLastZxid 0x13748 to the leader and begin to sync the log from 
0x13749, but failed due to network disconnection. The second time the 
follower try to reconnect to the leader, it sends the peerLastZxid 0x1385c 
to the leader, therefore, the log 0x13749 ~ 0x1385c is missing!!



> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper3.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-09 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044030#comment-16044030
 ] 

JiangJiafu commented on ZOOKEEPER-2800:
---

I have a quick look to the 2355, I am not pretty sure these are the same PR.
But from the log I can see that zk1(the problem node) do lost connection to the 
leader while wring data, and then many transcations are lost too(including the 
closeSession transcation).

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper3.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-09 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043987#comment-16043987
 ] 

JiangJiafu commented on ZOOKEEPER-2800:
---

In the recently environment, I found that, zk3 (leader) found the node expired, 
and then zk2 and zk3 deleted the node, but the transcation is not done in zk1!

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper3.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-08 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043968#comment-16043968
 ] 

JiangJiafu commented on ZOOKEEPER-2800:
---

I think this must be a bug, because the PR happens again in my environment.

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper3.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-07 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2800:
--
Attachment: zookeeper3.out

zookeeper log of ofs_zk3 

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper3.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-07 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2800:
--
Attachment: zookeeper2.out

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper2.out, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-07 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2800:
--
Attachment: zookeeper.out

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg, zookeeper.out
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-07 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2800:
--
Attachment: zoo.cfg

> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
> Attachments: zoo.cfg
>
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-05 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2800:
--
Description: 
I deploy a cluster of ZooKeeper with three nodes:

ofs_zk1:30.0.0.72
ofs_zk2:30.0.0.73
ofs_zk3:30.0.0.99

On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
/adm_election/rolemgr/rolemgr08,
/adm_election/rolemgr/rolemgr11,
/adm_election/rolemgr/rolemgr12,

with sesstion timeout 2 ms.

Then  I restart ofs_zk1 and ofs_zk2.


On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
I can check the nodes by zkCli.sh get command on ofs_zk1.
But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
Is it odd?


I have upload the whole deploy directory of three nodes to:
https://pan.baidu.com/s/1miohiCo ,
The log is printed in log/zookeeper.out

log of ofs_zk3 is too large, so I only show the head 1000 lines.

Since I find this PR a little late, some snapshot and log may be deleted.
I hope anyone can help find the reason.


  was:
I deploy a cluster of ZooKeeper with three nodes:

ofs_zk1:30.0.0.72
ofs_zk2:30.0.0.73
ofs_zk3:30.0.0.99

On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
/adm_election/rolemgr/rolemgr08,
/adm_election/rolemgr/rolemgr11,
/adm_election/rolemgr/rolemgr12,

with sesstion timeout 2 ms.

Then  I restart ofs_zk1 and ofs_zk2.


On 2017-06-05, I found that, the ephemeral  nodes still exist on ofs_zk1.
I can check the nodes by zkCli.sh get command on ofs_zk1.
But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
Is it odd?


I have upload the whole deploy directory of three nodes to:
https://pan.baidu.com/s/1miohiCo ,
The log is printed in log/zookeeper.out

log of ofs_zk3 is too large, so I only show the head 1000 lines.

Since I find this PR a little late, some snapshot and log may be deleted.
I hope anyone can help find the reason.



> zookeeper ephemeral node not deleted after server restart and consistency is 
> not hold
> -
>
> Key: ZOOKEEPER-2800
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.11
> Environment: Centos6.5 java8
>Reporter: JiangJiafu
>Priority: Critical
>
> I deploy a cluster of ZooKeeper with three nodes:
> ofs_zk1:30.0.0.72
> ofs_zk2:30.0.0.73
> ofs_zk3:30.0.0.99
> On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
> /adm_election/rolemgr/rolemgr08,
> /adm_election/rolemgr/rolemgr11,
> /adm_election/rolemgr/rolemgr12,
> with sesstion timeout 2 ms.
> Then  I restart ofs_zk1 and ofs_zk2.
> On 2017-06-05, I found that, these ephemeral  nodes still exist on ofs_zk1.
> I can check the nodes by zkCli.sh get command on ofs_zk1.
> But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
> Is it odd?
> I have upload the whole deploy directory of three nodes to:
> https://pan.baidu.com/s/1miohiCo ,
> The log is printed in log/zookeeper.out
> log of ofs_zk3 is too large, so I only show the head 1000 lines.
> Since I find this PR a little late, some snapshot and log may be deleted.
> I hope anyone can help find the reason.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ZOOKEEPER-2800) zookeeper ephemeral node not deleted after server restart and consistency is not hold

2017-06-05 Thread JiangJiafu (JIRA)
JiangJiafu created ZOOKEEPER-2800:
-

 Summary: zookeeper ephemeral node not deleted after server restart 
and consistency is not hold
 Key: ZOOKEEPER-2800
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2800
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.4.11
 Environment: Centos6.5 java8
Reporter: JiangJiafu
Priority: Critical


I deploy a cluster of ZooKeeper with three nodes:

ofs_zk1:30.0.0.72
ofs_zk2:30.0.0.73
ofs_zk3:30.0.0.99

On 2017-06-02, use the c zk client to create some ephemeral sequential nodes,:
/adm_election/rolemgr/rolemgr08,
/adm_election/rolemgr/rolemgr11,
/adm_election/rolemgr/rolemgr12,

with sesstion timeout 2 ms.

Then  I restart ofs_zk1 and ofs_zk2.


On 2017-06-05, I found that, the ephemeral  nodes still exist on ofs_zk1.
I can check the nodes by zkCli.sh get command on ofs_zk1.
But these nodes doesn't not exist on ofs_zk2 and ofs_zk3.
Is it odd?


I have upload the whole deploy directory of three nodes to:
https://pan.baidu.com/s/1miohiCo ,
The log is printed in log/zookeeper.out

log of ofs_zk3 is too large, so I only show the head 1000 lines.

Since I find this PR a little late, some snapshot and log may be deleted.
I hope anyone can help find the reason.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2691) recreateSocketAddresses may recreate the unreachable IP address

2017-05-21 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16018735#comment-16018735
 ] 

JiangJiafu commented on ZOOKEEPER-2691:
---

Can this patch be merged?

> recreateSocketAddresses may recreate the unreachable IP address
> ---
>
> Key: ZOOKEEPER-2691
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2691
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.5.0, 3.5.1, 3.5.2, 3.4.11
> Environment: Centos6.5
> Java8
> ZooKeeper3.4.8
>Reporter: JiangJiafu
>Priority: Minor
>
> The QuorumPeer$QuorumServer.recreateSocketAddress()  is used to resolved the 
> hostname to a new IP address(InetAddress) when any exception happens to the 
> socket. It will be very useful when a hostname can be resolved to more than 
> one IP address.
> But the problem is Java API InetAddress.getByName(String hostname) will 
> always return the first IP address when the hostname can be resolved to more 
> than one IP address, and the first IP address may be unreachable forever. For 
> example, if a machine has two network interfaces: eth0, eth1, say eth0 has 
> ip1, eth1 has ip2, the relationship between hostname and the IP addresses is 
> set in /etc/hosts. When I "close" the eth0 by command "ifdown eth0", the 
> InetAddress.getByName(String hostname)  will still return ip1, which is 
> unreachable forever.
> So I think it will be better to check the IP address by 
> InetAddress.isReachable(long) and choose the reachable IP address. 
> I have modified the ZooKeeper source code, and test the new code in my own 
> environment, and it can work very well when I turn down some network 
> interfaces using "ifdown" command.
> The original code is:
> {code:title=QuorumPeer.java|borderStyle=solid}
> public void recreateSocketAddresses() {
> InetAddress address = null;
> try {
> address = InetAddress.getByName(this.hostname);
> LOG.info("Resolved hostname: {} to address: {}", 
> this.hostname, address);
> this.addr = new InetSocketAddress(address, this.port);
> if (this.electionPort > 0){
> this.electionAddr = new InetSocketAddress(address, 
> this.electionPort);
> }
> } catch (UnknownHostException ex) {
> LOG.warn("Failed to resolve address: {}", this.hostname, ex);
> // Have we succeeded in the past?
> if (this.addr != null) {
> // Yes, previously the lookup succeeded. Leave things as 
> they are
> return;
> }
> // The hostname has never resolved. Create our 
> InetSocketAddress(es) as unresolved
> this.addr = InetSocketAddress.createUnresolved(this.hostname, 
> this.port);
> if (this.electionPort > 0){
> this.electionAddr = 
> InetSocketAddress.createUnresolved(this.hostname,
>
> this.electionPort);
> }
> }
> }
> {code}
> After my modification:
> {code:title=QuorumPeer.java|borderStyle=solid}
> public void recreateSocketAddresses() {
> InetAddress address = null;
> try {
> address = getReachableAddress(this.hostname);
> LOG.info("Resolved hostname: {} to address: {}", 
> this.hostname, address);
> this.addr = new InetSocketAddress(address, this.port);
> if (this.electionPort > 0){
> this.electionAddr = new InetSocketAddress(address, 
> this.electionPort);
> }
> } catch (UnknownHostException ex) {
> LOG.warn("Failed to resolve address: {}", this.hostname, ex);
> // Have we succeeded in the past?
> if (this.addr != null) {
> // Yes, previously the lookup succeeded. Leave things as 
> they are
> return;
> }
> // The hostname has never resolved. Create our 
> InetSocketAddress(es) as unresolved
> this.addr = InetSocketAddress.createUnresolved(this.hostname, 
> this.port);
> if (this.electionPort > 0){
> this.electionAddr = 
> InetSocketAddress.createUnresolved(this.hostname,
>
> this.electionPort);
> }
> }
> }
> public InetAddress getReachableAddress(String hostname) throws 
> UnknownHostException {
> InetAddress[] addresses = InetAddress.getAllByName(hostname);
> for (InetAddress 

[jira] [Created] (ZOOKEEPER-2788) The define of MAX_CONNECTION_ATTEMPTS in QuorumCnxManager.java seems useless, should it be removed?

2017-05-21 Thread JiangJiafu (JIRA)
JiangJiafu created ZOOKEEPER-2788:
-

 Summary: The define of MAX_CONNECTION_ATTEMPTS in 
QuorumCnxManager.java seems useless, should it be removed?
 Key: ZOOKEEPER-2788
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2788
 Project: ZooKeeper
  Issue Type: Improvement
  Components: leaderElection, quorum
Affects Versions: 3.4.10, 3.4.11
Reporter: JiangJiafu
Priority: Minor


The define of MAX_CONNECTION_ATTEMPTS in QuorumCnxManager.java seems useless, 
should it be removed?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2783) follower disconnects and cannot reconnect

2017-05-17 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16015134#comment-16015134
 ] 

JiangJiafu commented on ZOOKEEPER-2783:
---

I am not pretty sure, will this problem the same as ZOOKEEPER-2701??

> follower disconnects and cannot reconnect
> -
>
> Key: ZOOKEEPER-2783
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2783
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.10
> Environment: centos 7, AWS EC2
>Reporter: Ben Sherman
> Attachments: fail3.log, fail5.log
>
>
> We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well), 
> and sometimes, a node gets a read timeout, drops all the connections and 
> tries to re-establish itself to the quorum.  It can usually do this in a few 
> seconds, but last night it took almost 15 minutes to reconnect.
> These are 5 servers in AWS, and we've tried tuning the timeouts, but the are 
> exceeding any reasonable timeout and still failing.
> In the attached logs, 5 is a follower, 3 is the leader.  5 loses connectivity 
> at 11:21:34.  3 sees the disconnect at the same moment.
> 5 tries to re-establish the quorum, but cannot do it until the connections to 
> the other servers expire at 11:37:02.  After the connections are 
> re-established, 5 connects immediately.
> At 11:41:08, the operator restarted the server, and it reconnected normally.
> I suspect there is a problem with stale connections to the rest of the quorum 
> - the other services on this box were fine (monitoring, puppet) and able to 
> establish new connections with no problems.
> I posed this problem to the zookeeper-users list and was asked to open a 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-1167) C api lacks synchronous version of sync() call.

2017-05-16 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16013402#comment-16013402
 ] 

JiangJiafu commented on ZOOKEEPER-1167:
---

I have read all the comments above, but I don't get the point. 
In what kind of scenarios will this BUG cause a problem? It seems like this bug 
is not going to be fixed in 3.4.X version, why?

> C api lacks synchronous version of sync() call.
> ---
>
> Key: ZOOKEEPER-1167
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1167
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: c client
>Affects Versions: 3.3.3, 3.4.3, 3.5.0
>Reporter: Nicholas Harteau
>Assignee: Marshall McMullen
> Fix For: 3.5.4, 3.6.0
>
> Attachments: ZOOKEEPER-1167.patch
>
>
> Reading through the source, the C API implements zoo_async() which is the 
> zookeeper sync() method implemented in the multithreaded/asynchronous C API.  
> It doesn't implement anything equivalent in the non-multithreaded API.
> I'm not sure if this was oversight or intentional, but it means that the 
> non-multithreaded API can't guarantee consistent client views on critical 
> reads.
> The zkperl bindings depend on the synchronous, non-multithreaded API so also 
> can't call sync() currently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2691) recreateSocketAddresses may recreate the unreachable IP address

2017-05-16 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2691:
--
Affects Version/s: 3.4.11

> recreateSocketAddresses may recreate the unreachable IP address
> ---
>
> Key: ZOOKEEPER-2691
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2691
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.5.0, 3.5.1, 3.5.2, 3.4.11
> Environment: Centos6.5
> Java8
> ZooKeeper3.4.8
>Reporter: JiangJiafu
>Priority: Minor
>
> The QuorumPeer$QuorumServer.recreateSocketAddress()  is used to resolved the 
> hostname to a new IP address(InetAddress) when any exception happens to the 
> socket. It will be very useful when a hostname can be resolved to more than 
> one IP address.
> But the problem is Java API InetAddress.getByName(String hostname) will 
> always return the first IP address when the hostname can be resolved to more 
> than one IP address, and the first IP address may be unreachable forever. For 
> example, if a machine has two network interfaces: eth0, eth1, say eth0 has 
> ip1, eth1 has ip2, the relationship between hostname and the IP addresses is 
> set in /etc/hosts. When I "close" the eth0 by command "ifdown eth0", the 
> InetAddress.getByName(String hostname)  will still return ip1, which is 
> unreachable forever.
> So I think it will be better to check the IP address by 
> InetAddress.isReachable(long) and choose the reachable IP address. 
> I have modified the ZooKeeper source code, and test the new code in my own 
> environment, and it can work very well when I turn down some network 
> interfaces using "ifdown" command.
> The original code is:
> {code:title=QuorumPeer.java|borderStyle=solid}
> public void recreateSocketAddresses() {
> InetAddress address = null;
> try {
> address = InetAddress.getByName(this.hostname);
> LOG.info("Resolved hostname: {} to address: {}", 
> this.hostname, address);
> this.addr = new InetSocketAddress(address, this.port);
> if (this.electionPort > 0){
> this.electionAddr = new InetSocketAddress(address, 
> this.electionPort);
> }
> } catch (UnknownHostException ex) {
> LOG.warn("Failed to resolve address: {}", this.hostname, ex);
> // Have we succeeded in the past?
> if (this.addr != null) {
> // Yes, previously the lookup succeeded. Leave things as 
> they are
> return;
> }
> // The hostname has never resolved. Create our 
> InetSocketAddress(es) as unresolved
> this.addr = InetSocketAddress.createUnresolved(this.hostname, 
> this.port);
> if (this.electionPort > 0){
> this.electionAddr = 
> InetSocketAddress.createUnresolved(this.hostname,
>
> this.electionPort);
> }
> }
> }
> {code}
> After my modification:
> {code:title=QuorumPeer.java|borderStyle=solid}
> public void recreateSocketAddresses() {
> InetAddress address = null;
> try {
> address = getReachableAddress(this.hostname);
> LOG.info("Resolved hostname: {} to address: {}", 
> this.hostname, address);
> this.addr = new InetSocketAddress(address, this.port);
> if (this.electionPort > 0){
> this.electionAddr = new InetSocketAddress(address, 
> this.electionPort);
> }
> } catch (UnknownHostException ex) {
> LOG.warn("Failed to resolve address: {}", this.hostname, ex);
> // Have we succeeded in the past?
> if (this.addr != null) {
> // Yes, previously the lookup succeeded. Leave things as 
> they are
> return;
> }
> // The hostname has never resolved. Create our 
> InetSocketAddress(es) as unresolved
> this.addr = InetSocketAddress.createUnresolved(this.hostname, 
> this.port);
> if (this.electionPort > 0){
> this.electionAddr = 
> InetSocketAddress.createUnresolved(this.hostname,
>
> this.electionPort);
> }
> }
> }
> public InetAddress getReachableAddress(String hostname) throws 
> UnknownHostException {
> InetAddress[] addresses = InetAddress.getAllByName(hostname);
> for (InetAddress a : addresses) {
> 

[jira] [Updated] (ZOOKEEPER-2701) Timeout for RecvWorker is too long

2017-05-12 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2701:
--
Affects Version/s: 3.4.9
   3.4.10

> Timeout for RecvWorker is too long
> --
>
> Key: ZOOKEEPER-2701
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.8, 3.4.9, 3.4.10
> Environment: Centos6.5
> ZooKeeper 3.4.8
>Reporter: JiangJiafu
>Priority: Minor
>
> Environment:
> I deploy ZooKeeper in a cluster of three nodes. Each node has three network 
> interfaces(eth0, eth1, eth2).
> Hostname is used instead of IP address in zoo.cfg, and 
> quorumListenOnAllIPs=true
> Probleam:
> I start three ZooKeeper servers( node A, node B, and node C) one by one, 
> when the leader election finishes, node B is the leader. 
> Then I shutdown one network interface of node A by command "ifdown eth0". The 
> ZooKeeper server on node A will lost connection to node B and node C. In my 
> test, I will take about 20 minites that the ZooKeepr server of node A 
> realizes the event and try to call the QuorumServer.recreateSocketAddress the 
> resolve the hostname.
> I try to read the source code, and I find the code in 
> {code:title=QuorumCnxManager.java:|borderStyle=solid}
> class RecvWorker extends ZooKeeperThread {
> Long sid;
> Socket sock;
> volatile boolean running = true;
> final DataInputStream din;
> final SendWorker sw;
> RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) 
> {
> super("RecvWorker:" + sid);
> this.sid = sid;
> this.sock = sock;
> this.sw = sw;
> this.din = din;
> try {
> // OK to wait until socket disconnects while reading.
> sock.setSoTimeout(0);
> } catch (IOException e) {
> LOG.error("Error while accessing socket for " + sid, e);
> closeSocket(sock);
> running = false;
> }
> }
>...
>  }
> {code}
> I notice that the soTime is set to 0 in RecvWorker constructor. I think this 
> is reasonable when the IP address of a ZooKeeper server never change, but  
> considering that the IP address of each ZooKeeper server may change, maybe we 
> should better set a timeout here.
> I am not pretty sure this is really a problem. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2774) Ephemeral znode will not be removed when sesstion timeout, if the system time of ZooKeeper node changes unexpectedly.

2017-05-11 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16007441#comment-16007441
 ] 

JiangJiafu commented on ZOOKEEPER-2774:
---

OK。

> Ephemeral znode will not be removed when sesstion timeout, if the system time 
> of ZooKeeper node changes unexpectedly.
> -
>
> Key: ZOOKEEPER-2774
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2774
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.8, 3.4.9, 3.4.10
> Environment: Centos6.5
>Reporter: JiangJiafu
>
> 1. Deploy a ZooKeeper cluster with one node.
> 2. Create a Ephemeral znode.
> 3. Change the system time of the ZooKeeper node to a earlier point.
> 4. Disconnect the client with the ZooKeeper server.
> Then the ephemeral znode will exist for a long time even when session timeout.
> I have read the ZooKeeper source code and I find the code int 
> SessionTrackerImpl.java,
> {code:title=SessionTrackerImpl.java|borderStyle=solid}
> @Override
> synchronized public void run() {
> try {
> while (running) {
> currentTime = System.currentTimeMillis();
> if (nextExpirationTime > currentTime) {
> this.wait(nextExpirationTime - currentTime);
> continue;
> }
> SessionSet set;
> set = sessionSets.remove(nextExpirationTime);
> if (set != null) {
> for (SessionImpl s : set.sessions) {
> setSessionClosing(s.sessionId);
> expirer.expire(s);
> }
> }
> nextExpirationTime += expirationInterval;
> }
> } catch (InterruptedException e) {
> handleException(this.getName(), e);
> }
> LOG.info("SessionTrackerImpl exited loop!");
> }
> {code}
> I think it may be better to use System.nanoTime(), not 
> System.currentTimeMillis, because the later can be changed manually or 
> automatically by a NTP client. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2691) recreateSocketAddresses may recreate the unreachable IP address

2017-05-11 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2691:
--
Affects Version/s: 3.4.9
   3.4.10
   3.5.0
   3.5.1
   3.5.2

> recreateSocketAddresses may recreate the unreachable IP address
> ---
>
> Key: ZOOKEEPER-2691
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2691
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.5.0, 3.5.1, 3.5.2
> Environment: Centos6.5
> Java8
> ZooKeeper3.4.8
>Reporter: JiangJiafu
>Priority: Minor
>
> The QuorumPeer$QuorumServer.recreateSocketAddress()  is used to resolved the 
> hostname to a new IP address(InetAddress) when any exception happens to the 
> socket. It will be very useful when a hostname can be resolved to more than 
> one IP address.
> But the problem is Java API InetAddress.getByName(String hostname) will 
> always return the first IP address when the hostname can be resolved to more 
> than one IP address, and the first IP address may be unreachable forever. For 
> example, if a machine has two network interfaces: eth0, eth1, say eth0 has 
> ip1, eth1 has ip2, the relationship between hostname and the IP addresses is 
> set in /etc/hosts. When I "close" the eth0 by command "ifdown eth0", the 
> InetAddress.getByName(String hostname)  will still return ip1, which is 
> unreachable forever.
> So I think it will be better to check the IP address by 
> InetAddress.isReachable(long) and choose the reachable IP address. 
> I have modified the ZooKeeper source code, and test the new code in my own 
> environment, and it can work very well when I turn down some network 
> interfaces using "ifdown" command.
> The original code is:
> {code:title=QuorumPeer.java|borderStyle=solid}
> public void recreateSocketAddresses() {
> InetAddress address = null;
> try {
> address = InetAddress.getByName(this.hostname);
> LOG.info("Resolved hostname: {} to address: {}", 
> this.hostname, address);
> this.addr = new InetSocketAddress(address, this.port);
> if (this.electionPort > 0){
> this.electionAddr = new InetSocketAddress(address, 
> this.electionPort);
> }
> } catch (UnknownHostException ex) {
> LOG.warn("Failed to resolve address: {}", this.hostname, ex);
> // Have we succeeded in the past?
> if (this.addr != null) {
> // Yes, previously the lookup succeeded. Leave things as 
> they are
> return;
> }
> // The hostname has never resolved. Create our 
> InetSocketAddress(es) as unresolved
> this.addr = InetSocketAddress.createUnresolved(this.hostname, 
> this.port);
> if (this.electionPort > 0){
> this.electionAddr = 
> InetSocketAddress.createUnresolved(this.hostname,
>
> this.electionPort);
> }
> }
> }
> {code}
> After my modification:
> {code:title=QuorumPeer.java|borderStyle=solid}
> public void recreateSocketAddresses() {
> InetAddress address = null;
> try {
> address = getReachableAddress(this.hostname);
> LOG.info("Resolved hostname: {} to address: {}", 
> this.hostname, address);
> this.addr = new InetSocketAddress(address, this.port);
> if (this.electionPort > 0){
> this.electionAddr = new InetSocketAddress(address, 
> this.electionPort);
> }
> } catch (UnknownHostException ex) {
> LOG.warn("Failed to resolve address: {}", this.hostname, ex);
> // Have we succeeded in the past?
> if (this.addr != null) {
> // Yes, previously the lookup succeeded. Leave things as 
> they are
> return;
> }
> // The hostname has never resolved. Create our 
> InetSocketAddress(es) as unresolved
> this.addr = InetSocketAddress.createUnresolved(this.hostname, 
> this.port);
> if (this.electionPort > 0){
> this.electionAddr = 
> InetSocketAddress.createUnresolved(this.hostname,
>
> this.electionPort);
> }
> }
> }
> public InetAddress getReachableAddress(String hostname) throws 
> UnknownHostException {
> InetAddress[] 

[jira] [Commented] (ZOOKEEPER-2774) Ephemeral znode will not be removed when sesstion timeout, if the system time of ZooKeeper node changes unexpectedly.

2017-05-10 Thread JiangJiafu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005821#comment-16005821
 ] 

JiangJiafu commented on ZOOKEEPER-2774:
---

Is this PB planed to be solved in 3.4.X???

> Ephemeral znode will not be removed when sesstion timeout, if the system time 
> of ZooKeeper node changes unexpectedly.
> -
>
> Key: ZOOKEEPER-2774
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2774
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.8, 3.4.9, 3.4.10
> Environment: Centos6.5
>Reporter: JiangJiafu
>
> 1. Deploy a ZooKeeper cluster with one node.
> 2. Create a Ephemeral znode.
> 3. Change the system time of the ZooKeeper node to a earlier point.
> 4. Disconnect the client with the ZooKeeper server.
> Then the ephemeral znode will exist for a long time even when session timeout.
> I have read the ZooKeeper source code and I find the code int 
> SessionTrackerImpl.java,
> {code:title=SessionTrackerImpl.java|borderStyle=solid}
> @Override
> synchronized public void run() {
> try {
> while (running) {
> currentTime = System.currentTimeMillis();
> if (nextExpirationTime > currentTime) {
> this.wait(nextExpirationTime - currentTime);
> continue;
> }
> SessionSet set;
> set = sessionSets.remove(nextExpirationTime);
> if (set != null) {
> for (SessionImpl s : set.sessions) {
> setSessionClosing(s.sessionId);
> expirer.expire(s);
> }
> }
> nextExpirationTime += expirationInterval;
> }
> } catch (InterruptedException e) {
> handleException(this.getName(), e);
> }
> LOG.info("SessionTrackerImpl exited loop!");
> }
> {code}
> I think it may be better to use System.nanoTime(), not 
> System.currentTimeMillis, because the later can be changed manually or 
> automatically by a NTP client. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ZOOKEEPER-2774) Ephemeral znode will not be removed when sesstion timeout, if the system time of ZooKeeper node changes unexpectedly.

2017-05-03 Thread JiangJiafu (JIRA)
JiangJiafu created ZOOKEEPER-2774:
-

 Summary: Ephemeral znode will not be removed when sesstion 
timeout, if the system time of ZooKeeper node changes unexpectedly.
 Key: ZOOKEEPER-2774
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2774
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.10, 3.4.9, 3.4.8
 Environment: Centos6.5
Reporter: JiangJiafu


1. Deploy a ZooKeeper cluster with one node.
2. Create a Ephemeral znode.
3. Change the system time of the ZooKeeper node to a earlier point.
4. Disconnect the client with the ZooKeeper server.

Then the ephemeral znode will exist for a long time even when session timeout.

I have read the ZooKeeper source code and I find the code int 
SessionTrackerImpl.java,
{code:title=SessionTrackerImpl.java|borderStyle=solid}
@Override
synchronized public void run() {
try {
while (running) {
currentTime = System.currentTimeMillis();
if (nextExpirationTime > currentTime) {
this.wait(nextExpirationTime - currentTime);
continue;
}
SessionSet set;
set = sessionSets.remove(nextExpirationTime);
if (set != null) {
for (SessionImpl s : set.sessions) {
setSessionClosing(s.sessionId);
expirer.expire(s);
}
}
nextExpirationTime += expirationInterval;
}
} catch (InterruptedException e) {
handleException(this.getName(), e);
}
LOG.info("SessionTrackerImpl exited loop!");
}
{code}

I think it may be better to use System.nanoTime(), not 
System.currentTimeMillis, because the later can be changed manually or 
automatically by a NTP client. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ZOOKEEPER-2701) Timeout for RecvWorker is too long

2017-02-19 Thread JiangJiafu (JIRA)
JiangJiafu created ZOOKEEPER-2701:
-

 Summary: Timeout for RecvWorker is too long
 Key: ZOOKEEPER-2701
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.8
 Environment: Centos6.5
ZooKeeper 3.4.8
Reporter: JiangJiafu
Priority: Minor


Environment:
I deploy ZooKeeper in a cluster of three nodes. Each node has three network 
interfaces(eth0, eth1, eth2).

Hostname is used instead of IP address in zoo.cfg, and quorumListenOnAllIPs=true

Probleam:
I start three ZooKeeper servers( node A, node B, and node C) one by one, 
when the leader election finishes, node B is the leader. 
Then I shutdown one network interface of node A by command "ifdown eth0". The 
ZooKeeper server on node A will lost connection to node B and node C. In my 
test, I will take about 20 minites that the ZooKeepr server of node A realizes 
the event and try to call the QuorumServer.recreateSocketAddress the resolve 
the hostname.

I try to read the source code, and I find the code in 

{code:title=QuorumCnxManager.java:|borderStyle=solid}
class RecvWorker extends ZooKeeperThread {
Long sid;
Socket sock;
volatile boolean running = true;
final DataInputStream din;
final SendWorker sw;

RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) {
super("RecvWorker:" + sid);
this.sid = sid;
this.sock = sock;
this.sw = sw;
this.din = din;
try {
// OK to wait until socket disconnects while reading.
sock.setSoTimeout(0);
} catch (IOException e) {
LOG.error("Error while accessing socket for " + sid, e);
closeSocket(sock);
running = false;
}
}
   ...
 }
{code}


I notice that the soTime is set to 0 in RecvWorker constructor. I think this is 
reasonable when the IP address of a ZooKeeper server never change, but  
considering that the IP address of each ZooKeeper server may change, maybe we 
should better set a timeout here.

I am not pretty sure this is really a problem. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (ZOOKEEPER-2691) recreateSocketAddresses may recreate the unreachable IP address

2017-02-10 Thread JiangJiafu (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangJiafu updated ZOOKEEPER-2691:
--
Description: 
The QuorumPeer$QuorumServer.recreateSocketAddress()  is used to resolved the 
hostname to a new IP address(InetAddress) when any exception happens to the 
socket. It will be very useful when a hostname can be resolved to more than one 
IP address.
But the problem is Java API InetAddress.getByName(String hostname) will always 
return the first IP address when the hostname can be resolved to more than one 
IP address, and the first IP address may be unreachable forever. For example, 
if a machine has two network interfaces: eth0, eth1, say eth0 has ip1, eth1 has 
ip2, the relationship between hostname and the IP addresses is set in 
/etc/hosts. When I "close" the eth0 by command "ifdown eth0", the 
InetAddress.getByName(String hostname)  will still return ip1, which is 
unreachable forever.

So I think it will be better to check the IP address by 
InetAddress.isReachable(long) and choose the reachable IP address. 


I have modified the ZooKeeper source code, and test the new code in my own 
environment, and it can work very well when I turn down some network interfaces 
using "ifdown" command.

The original code is:
{code:title=QuorumPeer.java|borderStyle=solid}
public void recreateSocketAddresses() {
InetAddress address = null;
try {
address = InetAddress.getByName(this.hostname);
LOG.info("Resolved hostname: {} to address: {}", this.hostname, 
address);
this.addr = new InetSocketAddress(address, this.port);
if (this.electionPort > 0){
this.electionAddr = new InetSocketAddress(address, 
this.electionPort);
}
} catch (UnknownHostException ex) {
LOG.warn("Failed to resolve address: {}", this.hostname, ex);
// Have we succeeded in the past?
if (this.addr != null) {
// Yes, previously the lookup succeeded. Leave things as 
they are
return;
}
// The hostname has never resolved. Create our 
InetSocketAddress(es) as unresolved
this.addr = InetSocketAddress.createUnresolved(this.hostname, 
this.port);
if (this.electionPort > 0){
this.electionAddr = 
InetSocketAddress.createUnresolved(this.hostname,
   
this.electionPort);
}
}
}
{code}

After my modification:
{code:title=QuorumPeer.java|borderStyle=solid}
public void recreateSocketAddresses() {
InetAddress address = null;
try {
address = getReachableAddress(this.hostname);
LOG.info("Resolved hostname: {} to address: {}", this.hostname, 
address);
this.addr = new InetSocketAddress(address, this.port);
if (this.electionPort > 0){
this.electionAddr = new InetSocketAddress(address, 
this.electionPort);
}
} catch (UnknownHostException ex) {
LOG.warn("Failed to resolve address: {}", this.hostname, ex);
// Have we succeeded in the past?
if (this.addr != null) {
// Yes, previously the lookup succeeded. Leave things as 
they are
return;
}
// The hostname has never resolved. Create our 
InetSocketAddress(es) as unresolved
this.addr = InetSocketAddress.createUnresolved(this.hostname, 
this.port);
if (this.electionPort > 0){
this.electionAddr = 
InetSocketAddress.createUnresolved(this.hostname,
   
this.electionPort);
}
}
}

public InetAddress getReachableAddress(String hostname) throws 
UnknownHostException {
InetAddress[] addresses = InetAddress.getAllByName(hostname);
for (InetAddress a : addresses) {
try {
if (a.isReachable(5000)) {
return a;
} 
} catch (IOException e) {
LOG.warn("IP address {} is unreachable", a);
}
}
// All the IP address is unreachable, just return the first one.
return addresses[0];
}
{code}

  was:
The QuorumPeer$QuorumServer.recreateSocketAddress()  is used to resolved the 
hostname to a new IP address(InetAddress) when any exception happens to the 
socket. It will be very useful when a hostname can be resolved to more than one 
IP address.
But the problem is Java API InetAddress.getByName(String 

[jira] [Created] (ZOOKEEPER-2691) recreateSocketAddresses may recreate the unreachable IP address

2017-02-10 Thread JiangJiafu (JIRA)
JiangJiafu created ZOOKEEPER-2691:
-

 Summary: recreateSocketAddresses may recreate the unreachable IP 
address
 Key: ZOOKEEPER-2691
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2691
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.8
 Environment: Centos6.5
Java8
ZooKeeper3.4.8
Reporter: JiangJiafu
Priority: Minor


The QuorumPeer$QuorumServer.recreateSocketAddress()  is used to resolved the 
hostname to a new IP address(InetAddress) when any exception happens to the 
socket. It will be very useful when a hostname can be resolved to more than one 
IP address.
But the problem is Java API InetAddress.getByName(String hostname) will always 
return the first IP address when the hostname can be resolved to more than one 
IP address, and the first IP address may be unreachable forever. So I think it 
will be better to check the IP address by InetAddress.isReachable(long) and 
choose the reachable IP address. 

I have modified the ZooKeeper source code, and test the new code in my own 
environment, and it can work very well when I turn down some network interfaces 
using "ifdown" command.

The original code is:
{quote}
public void recreateSocketAddresses() {
InetAddress address = null;
try {
address = InetAddress.getByName(this.hostname);
LOG.info("Resolved hostname: {} to address: {}", this.hostname, 
address);
this.addr = new InetSocketAddress(address, this.port);
if (this.electionPort > 0){
this.electionAddr = new InetSocketAddress(address, 
this.electionPort);
}
} catch (UnknownHostException ex) {
LOG.warn("Failed to resolve address: {}", this.hostname, ex);
// Have we succeeded in the past?
if (this.addr != null) {
// Yes, previously the lookup succeeded. Leave things as 
they are
return;
}
// The hostname has never resolved. Create our 
InetSocketAddress(es) as unresolved
this.addr = InetSocketAddress.createUnresolved(this.hostname, 
this.port);
if (this.electionPort > 0){
this.electionAddr = 
InetSocketAddress.createUnresolved(this.hostname,
   
this.electionPort);
}
}
}
{quote}

After my modification:
{quote}
public void recreateSocketAddresses() {
InetAddress address = null;
try {
address = getReachableAddress(this.hostname);
LOG.info("Resolved hostname: {} to address: {}", this.hostname, 
address);
this.addr = new InetSocketAddress(address, this.port);
if (this.electionPort > 0){
this.electionAddr = new InetSocketAddress(address, 
this.electionPort);
}
} catch (UnknownHostException ex) {
LOG.warn("Failed to resolve address: {}", this.hostname, ex);
// Have we succeeded in the past?
if (this.addr != null) {
// Yes, previously the lookup succeeded. Leave things as 
they are
return;
}
// The hostname has never resolved. Create our 
InetSocketAddress(es) as unresolved
this.addr = InetSocketAddress.createUnresolved(this.hostname, 
this.port);
if (this.electionPort > 0){
this.electionAddr = 
InetSocketAddress.createUnresolved(this.hostname,
   
this.electionPort);
}
}
}

public InetAddress getReachableAddress(String hostname) throws 
UnknownHostException {
InetAddress[] addresses = InetAddress.getAllByName(hostname);
for (InetAddress a : addresses) {
try {
if (a.isReachable(5000)) {
return a;
} 
} catch (IOException e) {
LOG.warn("IP address {} is unreachable", a);
}
}
// All the IP address is unreachable, just return the first one.
return addresses[0];
}
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)