[jira] [Commented] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499893#comment-13499893 ] Raul Gutierrez Segales commented on ZOOKEEPER-1477: --- FWIW I have the same issue with zkCli on Fedora 18 / JDK 1.7.0_09-icedtea. Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Fix For: 3.4.6 Attachments: with-ZK-1550.txt I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1477) Test failures with Java 7 on Mac OS X
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13499895#comment-13499895 ] Raul Gutierrez Segales commented on ZOOKEEPER-1477: --- (On 3.4.4) Test failures with Java 7 on Mac OS X - Key: ZOOKEEPER-1477 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1477 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.3 Environment: Mac OS X Lion (10.7.4) Java version: java version 1.7.0_04 Java(TM) SE Runtime Environment (build 1.7.0_04-b21) Java HotSpot(TM) 64-Bit Server VM (build 23.0-b21, mixed mode) Reporter: Diwaker Gupta Fix For: 3.4.6 Attachments: with-ZK-1550.txt I downloaded ZK 3.4.3 sources and ran {{ant test}}. Many of the tests failed, including ZooKeeperTest. A common symptom was spurious {{ConnectionLossException}}: {code} 2012-06-01 12:01:23,420 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@54] - TEST METHOD FAILED testDeleteRecursiveAsync org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1246) at org.apache.zookeeper.ZooKeeperTest.testDeleteRecursiveAsync(ZooKeeperTest.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... (snipped) {code} As background, I was actually investigating some non-deterministic failures when using Netflix's Curator with Java 7 (see https://github.com/Netflix/curator/issues/79). After a while, I figured I should establish a clean ZK baseline first and realized it is actually a ZK issue, not a Curator issue. We are trying to migrate to Java 7 but this is a blocking issue for us right now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (ZOOKEEPER-1585) make dist for src/c broken in trunk
Raul Gutierrez Segales created ZOOKEEPER-1585: - Summary: make dist for src/c broken in trunk Key: ZOOKEEPER-1585 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1585 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1585) make dist for src/c broken in trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1585: -- Description: make dist from trunk is failing because of a wrong reference to src/zookeeper_log.h (which exists in include/). make dist for src/c broken in trunk --- Key: ZOOKEEPER-1585 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1585 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales make dist from trunk is failing because of a wrong reference to src/zookeeper_log.h (which exists in include/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1585) make dist for src/c broken in trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1585: -- Attachment: 0001-ZOOKEEPER-1585.-make-dist-for-src-c-broken-in-trunk.patch make dist for src/c broken in trunk --- Key: ZOOKEEPER-1585 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1585 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Attachments: 0001-ZOOKEEPER-1585.-make-dist-for-src-c-broken-in-trunk.patch make dist from trunk is failing because of a wrong reference to src/zookeeper_log.h (which exists in include/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1585) make dist for src/c broken in trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1585: -- Attachment: ZOOKEEPER-1585.patch sorry, previous patch was created via git diff which obviously doesn't work. This one is created via git-svn-diff which *should* work. make dist for src/c broken in trunk --- Key: ZOOKEEPER-1585 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1585 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Attachments: ZOOKEEPER-1585.patch make dist from trunk is failing because of a wrong reference to src/zookeeper_log.h (which exists in include/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1585) make dist for src/c broken in trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1585: -- Attachment: (was: 0001-ZOOKEEPER-1585.-make-dist-for-src-c-broken-in-trunk.patch) make dist for src/c broken in trunk --- Key: ZOOKEEPER-1585 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1585 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Attachments: ZOOKEEPER-1585.patch make dist from trunk is failing because of a wrong reference to src/zookeeper_log.h (which exists in include/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (ZOOKEEPER-1586) tarballs for zkfuse don't compile out of tree
Raul Gutierrez Segales created ZOOKEEPER-1586: - Summary: tarballs for zkfuse don't compile out of tree Key: ZOOKEEPER-1586 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1586 Project: ZooKeeper Issue Type: Bug Components: contrib-zkfuse Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1586) tarballs for zkfuse don't compile out of tree
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1586: -- Attachment: ZOOKEEPER-1586.patch tarballs for zkfuse don't compile out of tree - Key: ZOOKEEPER-1586 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1586 Project: ZooKeeper Issue Type: Bug Components: contrib-zkfuse Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Attachments: ZOOKEEPER-1586.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1519) Zookeeper Async calls can reference free()'d memory
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13587340#comment-13587340 ] Raul Gutierrez Segales commented on ZOOKEEPER-1519: --- Does sizeof *(void *) work? Zookeeper Async calls can reference free()'d memory --- Key: ZOOKEEPER-1519 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1519 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.3, 3.3.6 Environment: Ubuntu 11.10, Ubuntu packaged Zookeeper 3.3.3 with some backported fixes. Reporter: Mark Gius Attachments: zookeeper-1519.patch zoo_acreate() and zoo_aset() take a char * argument for data and prepare a call to zookeeper. This char * doesn't seem to be duplicated at any point, making it possible that the caller of the asynchronous function might potentially free() the char * argument before the zookeeper library completes its request. This is unlikely to present a real problem unless the freed memory is re-used before zookeeper consumes it. I've been unable to reproduce this issue using pure C as a result. However, ZKPython is a whole different story. Consider this snippet: ok = zookeeper.acreate(handle, path, json.dumps(value), acl, flags, callback) assert ok == zookeeper.OK In this snippet, json.dumps() allocates a string which is passed into the acreate(). When acreate() returns, the zookeeper request has been constructed with a pointer to the string allocated by json.dumps(). Also when acreate() returns, that string is now referenced by 0 things (ZKPython doesn't bump the refcount) and the string is eligible for garbage collection and re-use. The Zookeeper request now has a pointer to dangerous freed memory. I've been seeing odd behavior in our development environments for some time now, where it appeared as though two separate JSON payloads had been joined together. Python has been allocating a new JSON string in the middle of the old string that an incomplete zookeeper async call had not yet processed. I am not sure if this is a behavior that should be documented, or if the C binding implementation needs to be updated to create copies of the data payload provided for aset and acreate. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1519) Zookeeper Async calls can reference free()'d memory
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13587351#comment-13587351 ] Raul Gutierrez Segales commented on ZOOKEEPER-1519: --- Don't think so: http://fpaste.org/iwjf/ Zookeeper Async calls can reference free()'d memory --- Key: ZOOKEEPER-1519 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1519 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.3, 3.3.6 Environment: Ubuntu 11.10, Ubuntu packaged Zookeeper 3.3.3 with some backported fixes. Reporter: Mark Gius Attachments: zookeeper-1519.patch zoo_acreate() and zoo_aset() take a char * argument for data and prepare a call to zookeeper. This char * doesn't seem to be duplicated at any point, making it possible that the caller of the asynchronous function might potentially free() the char * argument before the zookeeper library completes its request. This is unlikely to present a real problem unless the freed memory is re-used before zookeeper consumes it. I've been unable to reproduce this issue using pure C as a result. However, ZKPython is a whole different story. Consider this snippet: ok = zookeeper.acreate(handle, path, json.dumps(value), acl, flags, callback) assert ok == zookeeper.OK In this snippet, json.dumps() allocates a string which is passed into the acreate(). When acreate() returns, the zookeeper request has been constructed with a pointer to the string allocated by json.dumps(). Also when acreate() returns, that string is now referenced by 0 things (ZKPython doesn't bump the refcount) and the string is eligible for garbage collection and re-use. The Zookeeper request now has a pointer to dangerous freed memory. I've been seeing odd behavior in our development environments for some time now, where it appeared as though two separate JSON payloads had been joined together. Python has been allocating a new JSON string in the middle of the old string that an incomplete zookeeper async call had not yet processed. I am not sure if this is a behavior that should be documented, or if the C binding implementation needs to be updated to create copies of the data payload provided for aset and acreate. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1552) Enable sync request processor in Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591454#comment-13591454 ] Raul Gutierrez Segales commented on ZOOKEEPER-1552: --- Small nit: typo in patch (recieved as INFORM packet). Enable sync request processor in Observer - Key: ZOOKEEPER-1552 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1552 Project: ZooKeeper Issue Type: Improvement Components: quorum, server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Thawan Kooburat Fix For: 3.5.0 Attachments: ZOOKEEPER-1552.patch, ZOOKEEPER-1552.patch Observer doesn't forward its txns to SyncRequestProcessor. So it never persists the txns onto disk or periodically creates snapshots. This increases the start-up time since it will get the entire snapshot if the observer has be running for a long time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13684025#comment-13684025 ] Raul Gutierrez Segales commented on ZOOKEEPER-1147: --- [~thawan]: how is this meant to work with a rolling update when enabling local sessions? If the leader doesn't have local sessions enabled then all writes from local sessions will fail with SessionExpired (because they'll be unknown to the leader) - right? The only way I could get a rolling update to work is with (something like) this: http://www.itevenworks.net/~rgs/patches/0001-Add-support-to-enable-disable-sessions-validations.patch I.e.: adding a way to temporarily disable sessions validations whilst you are enabling local sessions on the cluster. We should add some documentation about the right way to this. Thoughts? It would be nice to get this merged. Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13697996#comment-13697996 ] Raul Gutierrez Segales commented on ZOOKEEPER-1147: --- fwiw, we are using this patch in prod at Twitter so it would be awesome to have this merged. Besides what I mentioned in my previous comment (having a way to do rolling upgrades to enable local sessions) is there anything else that's left to get this merged? Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1721) Ability to run without writing to disk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13703403#comment-13703403 ] Raul Gutierrez Segales commented on ZOOKEEPER-1721: --- In Linux this is as easy as setting dataDir and dataLogDir to somewhere inside /dev/shm (possibly other platforms support something similar). Not sure it's worth supporting this with code as it might unnecessarily complicate other sections. Ability to run without writing to disk -- Key: ZOOKEEPER-1721 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1721 Project: ZooKeeper Issue Type: New Feature Components: server Affects Versions: 3.4.5 Reporter: Radim Kolar I use zookeeper for cluster synchronization. We have no need for keeping persistent state across zookeeper restarts. For performance enhancement would be good to have possibility to run without writing snapshots and logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1274) Support Child watcher to be displayed with 4 letter zookeeper commands i.e, wchs,wchp,wchc
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1274: -- Attachment: 0001-ZOOKEEPER-1274.-Display-child-watches-info-in-watch-.patch Support Child watcher to be displayed with 4 letter zookeeper commands i.e, wchs,wchp,wchc -- Key: ZOOKEEPER-1274 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1274 Project: ZooKeeper Issue Type: Bug Components: server Environment: Zookeeper Server Reporter: amith Assignee: Raul Gutierrez Segales Attachments: 0001-ZOOKEEPER-1274.-Display-child-watches-info-in-watch-.patch currently only data watchers (created by exists() and getdata() )are getting displayed with wchs,wchp,wchc 4 letter command command It would be useful to get the infomation related to childwatchers ( getChildren() ) also with 4 letter words. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: 0001-RFC-Don-t-tear-down-an-Observer-when-we-lose-connect.patch Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Attachments: 0001-RFC-Don-t-tear-down-an-Observer-when-we-lose-connect.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1147) Add support for local sessions
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762695#comment-13762695 ] Raul Gutierrez Segales commented on ZOOKEEPER-1147: --- ping? any progress to get this merged? Add support for local sessions -- Key: ZOOKEEPER-1147 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1147 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.3.3 Reporter: Vishal Kathuria Assignee: Thawan Kooburat Labels: api-change, scaling Fix For: 3.5.0 Attachments: ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch, ZOOKEEPER-1147.patch Original Estimate: 840h Remaining Estimate: 840h This improvement is in the bucket of making ZooKeeper work at a large scale. We are planning on having about a 1 million clients connect to a ZooKeeper ensemble through a set of 50-100 observers. Majority of these clients are read only - ie they do not do any updates or create ephemeral nodes. In ZooKeeper today, the client creates a session and the session creation is handled like any other update. In the above use case, the session create/drop workload can easily overwhelm an ensemble. The following is a proposal for a local session, to support a larger number of connections. 1. The idea is to introduce a new type of session - local session. A local session doesn't have a full functionality of a normal session. 2. Local sessions cannot create ephemeral nodes. 3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good. 4. When a local session connects, the session info is only maintained on the zookeeper server (in this case, an observer) that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk. 5. The pings and expiration is handled by the server that the session is connected to. With the above changes, we can make ZooKeeper scale to a much larger number of clients without making the core ensemble a bottleneck. In terms of API, there are two options that are being considered 1. Let the client specify at the connect time which kind of session do they want. 2. All sessions connect as local sessions and automatically get promoted to global sessions when they do an operation that requires a global session (e.g. creating an ephemeral node) Chubby took the approach of lazily promoting all sessions to global, but I don't think that would work in our case, where we want to keep sessions which never create ephemeral nodes as always local. Option 2 would make it more broadly usable but option 1 would be easier to implement. We are thinking of implementing option 1 as the first cut. There would be a client flag, IsLocalSession (much like the current readOnly flag) that would be used to determine whether to create a local session or a global session. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: persistent-read-only-for-observers.patch New version of the patch with tests. Also - this is generated with git diff -p so it should be Hadoop QA friendly. Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Attachments: persistent-read-only-for-observers.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: (was: 0001-RFC-Don-t-tear-down-an-Observer-when-we-lose-connect.patch) Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Attachments: persistent-read-only-for-observers.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13766740#comment-13766740 ] Raul Gutierrez Segales commented on ZOOKEEPER-1607: --- Arrrg I guess dependent patches aren't applied :( Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Attachments: persistent-read-only-for-observers.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1681) ZooKeeper 3.4.x can optionally use netty for nio but the pom does not declare the dep as optional
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13775636#comment-13775636 ] Raul Gutierrez Segales commented on ZOOKEEPER-1681: --- I guess https://issues.apache.org/jira/browse/ZOOKEEPER-1763 would help. ZooKeeper 3.4.x can optionally use netty for nio but the pom does not declare the dep as optional - Key: ZOOKEEPER-1681 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1681 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.4.0, 3.4.1, 3.4.2, 3.4.4, 3.4.5 Reporter: John Sirois Fix For: 3.5.0 For example in [3.4.5|http://search.maven.org/remotecontent?filepath=org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.pom] we see: {code} $ curl -sS http://search.maven.org/remotecontent?filepath=org/apache/zookeeper/zookeeper/3.4.5/zookeeper-3.4.5.pom | grep -B1 -A4 org.jboss.netty dependency groupIdorg.jboss.netty/groupId artifactIdnetty/artifactId version3.2.2.Final/version scopecompile/scope /dependency {code} As a consumer I can depend on zookeeper with an exclude for org.jboss.netty#netty or I can let my transitive dep resolver pick a winner. This might be fine, except for those who might be using a more modern netty published under the newish io.netty groupId. With this twist you get both org.jboss.netty#netty;foo and io.netty#netty;bar on your classpath and runtime errors ensue from incompatibilities. unless you add an exclude against zookeeper (and clearly don't enable the zk netty nio handling.) I propose that this is a pom bug although this is debatable. Clearly as currently packaged zookeeper needs netty to compile, but I'd argue since it does not need netty to run, either the scope should be provided or optional or a zookeeper-netty lib should be broken out as an optional dependency and this new dep published by zookeeper can have a proper compile dependency on netty. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1638) Redundant zk.getZKDatabase().clear();
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788293#comment-13788293 ] Raul Gutierrez Segales commented on ZOOKEEPER-1638: --- [~neilb]: did you upload the right patch? I think you wanted: {noformat} - // clear our own database and read + // db is clear as part of deserializeSnapshot() - zk.getZKDatabase().clear(); {noformat} Redundant zk.getZKDatabase().clear(); - Key: ZOOKEEPER-1638 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1638 Project: ZooKeeper Issue Type: Improvement Reporter: Alexander Shraer Assignee: neil bhakta Priority: Trivial Labels: newbie Fix For: 3.5.0 Attachments: ZOOKEEPER-1638.patch Learner.syncWithLeader calls zk.getZKDatabase().clear() right before zk.getZKDatabase().deserializeSnapshot(leaderIs); Then the first thing deserializeSnapshot does is another clear(). Suggest to remove the clear() in syncWithLeader. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1784) Logic to process INFORMANDACTIVATE packets in syncWithLeader seems bogus
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1784: -- Attachment: ZOOKEEPER-1784.patch Logic to process INFORMANDACTIVATE packets in syncWithLeader seems bogus Key: ZOOKEEPER-1784 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1784 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Attachments: ZOOKEEPER-1784.patch If you look at Learner#syncWithLeader: {noformat} while (self.isRunning()) { readPacket(qp); switch(qp.getType()) { ... case Leader.INFORM: case Leader.INFORMANDACTIVATE: PacketInFlight packet = new PacketInFlight(); packet.hdr = new TxnHeader(); if (qp.getType() == Leader.COMMITANDACTIVATE) { {noformat} I guess qp.getType() == Leader.COMMITANDACTIVATE is a typo that should read qp.getType() == Leader.INFORMANDACTIVATE. Assigning to Alexander for now since this is part of ZOOKEEPER-107. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (ZOOKEEPER-1784) Logic to process INFORMANDACTIVATE packets in syncWithLeader seems bogus
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales reassigned ZOOKEEPER-1784: - Assignee: Raul Gutierrez Segales (was: Alexander Shraer) Logic to process INFORMANDACTIVATE packets in syncWithLeader seems bogus Key: ZOOKEEPER-1784 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1784 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Attachments: ZOOKEEPER-1784.patch If you look at Learner#syncWithLeader: {noformat} while (self.isRunning()) { readPacket(qp); switch(qp.getType()) { ... case Leader.INFORM: case Leader.INFORMANDACTIVATE: PacketInFlight packet = new PacketInFlight(); packet.hdr = new TxnHeader(); if (qp.getType() == Leader.COMMITANDACTIVATE) { {noformat} I guess qp.getType() == Leader.COMMITANDACTIVATE is a typo that should read qp.getType() == Leader.INFORMANDACTIVATE. Assigning to Alexander for now since this is part of ZOOKEEPER-107. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1784) Logic to process INFORMANDACTIVATE packets in syncWithLeader seems bogus
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789799#comment-13789799 ] Raul Gutierrez Segales commented on ZOOKEEPER-1784: --- [~shralex]: so that code path, processing INFORMANDACTIVATE, doesn't have (it seems) a corresponding test case. Should we add one or extend an existing one to cover it? Logic to process INFORMANDACTIVATE packets in syncWithLeader seems bogus Key: ZOOKEEPER-1784 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1784 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Attachments: ZOOKEEPER-1784.patch If you look at Learner#syncWithLeader: {noformat} while (self.isRunning()) { readPacket(qp); switch(qp.getType()) { ... case Leader.INFORM: case Leader.INFORMANDACTIVATE: PacketInFlight packet = new PacketInFlight(); packet.hdr = new TxnHeader(); if (qp.getType() == Leader.COMMITANDACTIVATE) { {noformat} I guess qp.getType() == Leader.COMMITANDACTIVATE is a typo that should read qp.getType() == Leader.INFORMANDACTIVATE. Assigning to Alexander for now since this is part of ZOOKEEPER-107. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1586) tarballs for zkfuse don't compile out of tree
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789827#comment-13789827 ] Raul Gutierrez Segales commented on ZOOKEEPER-1586: --- Yes, I believe it is. Actually there are two issues: a) the tarball is created with missing source files and b) (more importantly) the BUILD path for the C libs is wrong. Should be: {noformat} -AC_CHECK_LIB(zookeeper_mt, main, [ZOOKEEPER_LD=-L${ZOOKEEPER_PATH}/.libs -lzookeeper_mt],,[-L${ZOOKEEPER_PATH}/.libs]) +ZOOKEEPER_BUILD_PATH=${BUILD_PATH}/../../../build/c +AC_CHECK_LIB(zookeeper_mt, main, [ZOOKEEPER_LD=-L${ZOOKEEPER_BUILD_PATH}/.libs -lzookeeper_mt],,[-L${ZOOKEEPER_BUILD_PATH}/.libs]) {noformat} And the third thing would be the configure.ac doesn't state that boost is needed. I'll update the patch. tarballs for zkfuse don't compile out of tree - Key: ZOOKEEPER-1586 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1586 Project: ZooKeeper Issue Type: Bug Components: contrib-zkfuse Affects Versions: 3.5.0 Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Attachments: ZOOKEEPER-1586.patch -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Assigned] (ZOOKEEPER-1019) zkfuse doesn't list dependency on boost in README
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales reassigned ZOOKEEPER-1019: - Assignee: Raul Gutierrez Segales zkfuse doesn't list dependency on boost in README - Key: ZOOKEEPER-1019 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1019 Project: ZooKeeper Issue Type: Improvement Components: contrib Affects Versions: 3.4.0 Reporter: Karel Vervaeke Assignee: Raul Gutierrez Segales Original Estimate: 5m Remaining Estimate: 5m The README.txt under contrib/fuse doesn't list boost under Development build libraries -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1019) zkfuse doesn't list dependency on boost in README
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789923#comment-13789923 ] Raul Gutierrez Segales commented on ZOOKEEPER-1019: --- [~phunt]: i haven't seen the crashes. I'll upload a patch that adds: {noformat} AC_CHECK_LIB([boost], [main], [], [AC_MSG_ERROR(We need boost to build zkfuse)]) {noformat} or: {noformat} AC_CHECK_HEADERS([boost/shared_ptr.hpp boost/shared_array.hpp boost/date_time/gregorian/gregorian.hpp],,AC_MSG_ERROR([boost library headers not found. Please install boost library.])) {noformat} or such such to configure.ac (as well us updating the README). zkfuse doesn't list dependency on boost in README - Key: ZOOKEEPER-1019 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1019 Project: ZooKeeper Issue Type: Improvement Components: contrib Affects Versions: 3.4.0 Reporter: Karel Vervaeke Assignee: Raul Gutierrez Segales Original Estimate: 5m Remaining Estimate: 5m The README.txt under contrib/fuse doesn't list boost under Development build libraries -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1019) zkfuse doesn't list dependency on boost in README
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1019: -- Attachment: ZOOKEEPER-1019 This patch fixes a couple of things: - adds checks for log4cxx - adds checks for boost's headers (the ones zkfuse uses) - sets the right path to Zk C libs build which isn't src/c (it's build/c) (fixes ZOOKEEPER-1586) - updates the README to mention boost zkfuse doesn't list dependency on boost in README - Key: ZOOKEEPER-1019 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1019 Project: ZooKeeper Issue Type: Improvement Components: contrib Affects Versions: 3.4.0 Reporter: Karel Vervaeke Assignee: Raul Gutierrez Segales Attachments: ZOOKEEPER-1019 Original Estimate: 5m Remaining Estimate: 5m The README.txt under contrib/fuse doesn't list boost under Development build libraries -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1019) zkfuse doesn't list dependency on boost in README
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790086#comment-13790086 ] Raul Gutierrez Segales commented on ZOOKEEPER-1019: --- [~phunt]: ack - patch is go. I will rebase/update ZOOKEEPER-1586 to include the remaining bits on top of this one, thanks. zkfuse doesn't list dependency on boost in README - Key: ZOOKEEPER-1019 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1019 Project: ZooKeeper Issue Type: Improvement Components: contrib Affects Versions: 3.4.0 Reporter: Karel Vervaeke Assignee: Raul Gutierrez Segales Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1019 Original Estimate: 5m Remaining Estimate: 5m The README.txt under contrib/fuse doesn't list boost under Development build libraries -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1787) Add support enabling local session in rolling upgrade
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790692#comment-13790692 ] Raul Gutierrez Segales commented on ZOOKEEPER-1787: --- [~thawan]: would you like me to rebase the patch I proposed in ZOOKEEPER-1147 (with the comments you suggested) or do you have a better/different approach? Add support enabling local session in rolling upgrade - Key: ZOOKEEPER-1787 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1787 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Thawan Kooburat Priority: Minor Currently, local session need to be enable by stopping the entire ensemble. If a rolling upgrade is used, all write request from a local session will fail with session move until the local session is enabled on the leader. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (ZOOKEEPER-1788) Support clientID field on connection requests
Raul Gutierrez Segales created ZOOKEEPER-1788: - Summary: Support clientID field on connection requests Key: ZOOKEEPER-1788 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1788 Project: ZooKeeper Issue Type: Improvement Reporter: Raul Gutierrez Segales Priority: Minor I suspect it's very common for deployments to have a wide variety of client libraries (different versions/languages) connecting to a given cluster. It would be handy to have a way to identify clients via a clientID (akin to HTTP's User-Agent header). This could be implemented in ZooKeeperServer#processConnectRequest [1] and be fully backwards compatible. The clientID could then be kept with the corresponding ServerCnxn instance and be used for better logging (or stats expose through 4-letter commands). The corresponding client side change would be to expose API to set the clientID on each connection handler (and by default it could be something like zk java $version, zk c $version, etc). Thoughts? [1] https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L797 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (ZOOKEEPER-1789) 3.4.x observer causes NPE on 3.5.0 (trunk) participants
Raul Gutierrez Segales created ZOOKEEPER-1789: - Summary: 3.4.x observer causes NPE on 3.5.0 (trunk) participants Key: ZOOKEEPER-1789 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1789 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Alexander Shraer (assigning to Alex because this was introduced by ZOOKEEPER-107, but will upload a patch as well.) I have a 5 participants cluster running what will be 3.5.0 (i.e.: trunk as of today) and an observer running 3.4 (trunk from 3.4 branch). When the observer tries to establish a connection to the participants I get: {noformat} Thread Thread[10.40.78.121:3888,5,main] died java.lang.NullPointerException at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:240) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:552) {noformat} Looking at QuorumCnxManager.java:240: {noformat} if (protocolVersion = 0) { // this is a server id and not a protocol version sid = protocolVersion; electionAddr = self.getVotingView().get(sid).electionAddr; } else { {noformat} and self.getVotingView().get(sid) will be null for Observers. So this block should cover that case. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1633) Introduce a protocol version to connection initiation message
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792124#comment-13792124 ] Raul Gutierrez Segales commented on ZOOKEEPER-1633: --- Are observers mandated to have -1 as their sid? (lots of test cases in the code base (and deployments!) use something 0). Plus the docs don't indicate that -1 should be used: http://zookeeper.apache.org/doc/r3.3.1/zookeeperObservers.html. Introduce a protocol version to connection initiation message - Key: ZOOKEEPER-1633 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1633 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.4.6 Attachments: ZOOKEEPER-1633.patch, ZOOKEEPER-1633-v4.patch, ZOOKEEPER-1633-v4.patch, ZOOKEEPER-1633-ver2.patch, ZOOKEEPER-1633-ver3.patch Currently the first message a server sends to another server includes just one field - the server's id (long). This is in QuorumCnxManager.java. This makes changes to the information passed during this initial connection very difficult. This patch will change the first field of the message to be a protocol version (a negative number that can't be a server id). The second field will be the server id. The third field is number of bytes in the remainder of the message. A 3.4 server will read the first field as before, but if this is a negative number it will read the second field to find the server id, and then remove the remainder of the message from the stream. This will not affect 3.4 since 3.4 and earlier servers send just the server id (so the code in the patch will not run unless there is a server 3.4 trying to connect). This will, however, provide the necessary flexibility for future releases as well as an upgrade path from 3.4 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1633) Introduce a protocol version to connection initiation message
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792126#comment-13792126 ] Raul Gutierrez Segales commented on ZOOKEEPER-1633: --- Ah - never mind. Alex clarified this in ZOOKEEPER-1789. Introduce a protocol version to connection initiation message - Key: ZOOKEEPER-1633 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1633 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.4.6 Attachments: ZOOKEEPER-1633.patch, ZOOKEEPER-1633-v4.patch, ZOOKEEPER-1633-v4.patch, ZOOKEEPER-1633-ver2.patch, ZOOKEEPER-1633-ver3.patch Currently the first message a server sends to another server includes just one field - the server's id (long). This is in QuorumCnxManager.java. This makes changes to the information passed during this initial connection very difficult. This patch will change the first field of the message to be a protocol version (a negative number that can't be a server id). The second field will be the server id. The third field is number of bytes in the remainder of the message. A 3.4 server will read the first field as before, but if this is a negative number it will read the second field to find the server id, and then remove the remainder of the message from the stream. This will not affect 3.4 since 3.4 and earlier servers send just the server id (so the code in the patch will not run unless there is a server 3.4 trying to connect). This will, however, provide the necessary flexibility for future releases as well as an upgrade path from 3.4 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1633) Introduce a protocol version to connection initiation message
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13792141#comment-13792141 ] Raul Gutierrez Segales commented on ZOOKEEPER-1633: --- Now that I come to think of it, I might have seen it in prod. Introduce a protocol version to connection initiation message - Key: ZOOKEEPER-1633 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1633 Project: ZooKeeper Issue Type: Bug Components: server Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.4.6 Attachments: ZOOKEEPER-1633.patch, ZOOKEEPER-1633-v4.patch, ZOOKEEPER-1633-v4.patch, ZOOKEEPER-1633-ver2.patch, ZOOKEEPER-1633-ver3.patch Currently the first message a server sends to another server includes just one field - the server's id (long). This is in QuorumCnxManager.java. This makes changes to the information passed during this initial connection very difficult. This patch will change the first field of the message to be a protocol version (a negative number that can't be a server id). The second field will be the server id. The third field is number of bytes in the remainder of the message. A 3.4 server will read the first field as before, but if this is a negative number it will read the second field to find the server id, and then remove the remainder of the message from the stream. This will not affect 3.4 since 3.4 and earlier servers send just the server id (so the code in the patch will not run unless there is a server 3.4 trying to connect). This will, however, provide the necessary flexibility for future releases as well as an upgrade path from 3.4 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (ZOOKEEPER-1792) Observers don't need to keep the an in-memory copy of last commited proposals
Raul Gutierrez Segales created ZOOKEEPER-1792: - Summary: Observers don't need to keep the an in-memory copy of last commited proposals Key: ZOOKEEPER-1792 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1792 Project: ZooKeeper Issue Type: Improvement Reporter: Raul Gutierrez Segales Priority: Minor In FinalRequestProcessor.java#processRequest we have: {noformat} if (request.isQuorum()) { zks.getZKDatabase().addCommittedProposal(request); } {noformat} but this is only useful to the leader since committed proposals are only used from LearnerHandler to sync up followers. I presume followers do need it as they might become a leader at any point. But observers have no need for them, so we could probably special case this for them and optimize the path for them. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1795) unable to build c client on ubuntu
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13794400#comment-13794400 ] Raul Gutierrez Segales commented on ZOOKEEPER-1795: --- fwiw, happens in Fedora 19 too. this patch works for me: {noformat} diff --git a/src/c/tests/TestReconfigServer.cc b/src/c/tests/TestReconfigServer.cc index 90bf6f6..a847b37 100644 --- a/src/c/tests/TestReconfigServer.cc +++ b/src/c/tests/TestReconfigServer.cc @@ -16,6 +16,7 @@ */ #include algorithm #include cppunit/extensions/HelperMacros.h +#include unistd.h #include zookeeper.h #include Util.h diff --git a/src/c/tests/ZooKeeperQuorumServer.cc b/src/c/tests/ZooKeeperQuorumServer.cc index f8049d2..23392cd 100644 --- a/src/c/tests/ZooKeeperQuorumServer.cc +++ b/src/c/tests/ZooKeeperQuorumServer.cc @@ -21,6 +21,7 @@ #include cstdlib #include fstream #include sstream +#include unistd.h ZooKeeperQuorumServer:: ZooKeeperQuorumServer(uint32_t id, uint32_t numServers) : {noformat} unable to build c client on ubuntu -- Key: ZOOKEEPER-1795 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1795 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Patrick Hunt Priority: Blocker Fix For: 3.5.0 Seems there is an issue for Ubuntu (I'm on 13.04), however I'm only seeing it on trunk and not branch 34 {noformat} make check make zktest-st zktest-mt make[1]: Entering directory `/home/phunt/dev/svn/svn-zookeeper/src/c' g++ -DHAVE_CONFIG_H -I. -I./include -I./tests -I./generated -DUSE_STATIC_LIB -DZKSERVER_CMD=\./tests/zkServer.sh\ -DZOO_IPV6_ENABLED -g -O2 -MT zktest_st-TestReconfigServer.o -MD -MP -MF .deps/zktest_st-TestReconfigServer.Tpo -c -o zktest_st-TestReconfigServer.o `test -f 'tests/TestReconfigServer.cc' || echo './'`tests/TestReconfigServer.cc tests/TestReconfigServer.cc: In member function 'bool TestReconfigServer::waitForConnected(zhandle_t*, uint32_t)': tests/TestReconfigServer.cc:128:16: error: 'sleep' was not declared in this scope make[1]: *** [zktest_st-TestReconfigServer.o] Error 1 make[1]: Leaving directory `/home/phunt/dev/svn/svn-zookeeper/src/c' make: *** [check-am] Error 2 {noformat} I have {noformat} g++ --version g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (ZOOKEEPER-1796) Move common code from {Follower, Observer}ZooKeeperServer into LearnerZooKeeperServer
Raul Gutierrez Segales created ZOOKEEPER-1796: - Summary: Move common code from {Follower, Observer}ZooKeeperServer into LearnerZooKeeperServer Key: ZOOKEEPER-1796 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1796 Project: ZooKeeper Issue Type: Improvement Reporter: Raul Gutierrez Segales Priority: Trivial Since ZOOKEEPER-1552 we are enabling syncProcessor in Observers, so we should have a proper shutdown() method there. Since FollowerZooKeeperServer already has one, which does the same thing that we need, move that to LearnerZooKeeperServer along with some related instance variables. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1796) Move common code from {Follower, Observer}ZooKeeperServer into LearnerZooKeeperServer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1796: -- Attachment: ZOOKEEPER-1796.patch Move common code from {Follower, Observer}ZooKeeperServer into LearnerZooKeeperServer - Key: ZOOKEEPER-1796 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1796 Project: ZooKeeper Issue Type: Improvement Reporter: Raul Gutierrez Segales Priority: Trivial Attachments: ZOOKEEPER-1796.patch Since ZOOKEEPER-1552 we are enabling syncProcessor in Observers, so we should have a proper shutdown() method there. Since FollowerZooKeeperServer already has one, which does the same thing that we need, move that to LearnerZooKeeperServer along with some related instance variables. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1795) unable to build c client on ubuntu
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1795: -- Attachment: ZOOKEEPER-1795.patch unistd.h is needed for sleep(). unable to build c client on ubuntu -- Key: ZOOKEEPER-1795 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1795 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.5.0 Reporter: Patrick Hunt Priority: Blocker Fix For: 3.5.0 Attachments: ZOOKEEPER-1795.patch Seems there is an issue for Ubuntu (I'm on 13.04), however I'm only seeing it on trunk and not branch 34 {noformat} make check make zktest-st zktest-mt make[1]: Entering directory `/home/phunt/dev/svn/svn-zookeeper/src/c' g++ -DHAVE_CONFIG_H -I. -I./include -I./tests -I./generated -DUSE_STATIC_LIB -DZKSERVER_CMD=\./tests/zkServer.sh\ -DZOO_IPV6_ENABLED -g -O2 -MT zktest_st-TestReconfigServer.o -MD -MP -MF .deps/zktest_st-TestReconfigServer.Tpo -c -o zktest_st-TestReconfigServer.o `test -f 'tests/TestReconfigServer.cc' || echo './'`tests/TestReconfigServer.cc tests/TestReconfigServer.cc: In member function 'bool TestReconfigServer::waitForConnected(zhandle_t*, uint32_t)': tests/TestReconfigServer.cc:128:16: error: 'sleep' was not declared in this scope make[1]: *** [zktest_st-TestReconfigServer.o] Error 1 make[1]: Leaving directory `/home/phunt/dev/svn/svn-zookeeper/src/c' make: *** [check-am] Error 2 {noformat} I have {noformat} g++ --version g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: (was: persistent-read-only-for-observers.patch) Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Raul Gutierrez Segales Fix For: 3.5.0 This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: ZOOKEEPER-1607.patch Chatted with Thawan about this and this still probably has to change but I wanted to go ahead and post this for any interested passerby (since local sessions support has been merged). Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1607.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: (was: ZOOKEEPER-1607.patch) Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1607.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1607) Read-only Observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1607: -- Attachment: ZOOKEEPER-1607.patch The prev patch had remaining bits and pieces of an internal patch to keep stats using Twitter's stats-util - soz. Read-only Observer -- Key: ZOOKEEPER-1607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1607 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.4.3 Reporter: Thawan Kooburat Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1607.patch This feature reused some of the mechanism already provided by ReadOnlyZooKeeper (ZOOKEEPER-704) but implemented in a different way Goal: read-only clients should be able to connect to the observer or continue to read data from the observer event when there is an outage of underling quorum. This means that it is possible for the observer to provide 100% read uptime for read-only local session (ZOOKEEPER-1147) Implementation: The observer don't tear down itself when it lose connection with the leader. It only close the connection associated with non read-only sessions and global sessions. So the client can try other observer if this is a temporal failure. During the outage, the observer switch to read-only mode. All the pending and future write requests get will get NOT_READONLY error. Read-only state transition is sent to all session on that observer. The observer only accepts a new connection from a read-only client. When the observer is able to reconnect to the leader. It sends state transition (CONNECTED_STATE) to all current session. If it is able to synchronize with the leader using DIFF, the steam of txns is sent through the commit processor instead of applying to the DataTree directly to prevent raise condition between in-flight read requests (see ZOOKEEPER-1505). The client will receive watch events correctly and can start issuing write requests. However, if the observer is getting the snapshot. It need to drop all the connection since it cannot fire a watch correctly. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807102#comment-13807102 ] Raul Gutierrez Segales commented on ZOOKEEPER-1732: --- [~fpj], [~abranzyck]: did you guys test this patch when joining a cluster of servers running without this patch (i.e.: trunk, only without this patch)? After rolling the first 2 followers - in a 5 member ensemble - the 3rd follower fails to join with this: {noformat} 2013-10-28 18:43:18,134 - INFO [WorkerReceiver[myid=4]] - Notification: 4 (n.leader), 0x890415 (n.zxid), 0x6 (n.round), LOOKING (n.state), 4 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,134 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x (n.round), FOLLOWING (n.state), 0 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,135 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x6 (n.round), LEADING (n.state), 2 (n.sid), 0x88 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,135 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x6 (n.round), FOLLOWING (n.state), 3 (n.sid), 0x88 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,136 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x (n.round), FOLLOWING (n.state), 1 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version) {noformat} I am guessing IGNOREVALUE (0x) as the round value is causing issues? What was the expected behavior here (i.e.: when dealing with cluster members without this patch during an upgrade)? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807565#comment-13807565 ] Raul Gutierrez Segales commented on ZOOKEEPER-1732: --- What's wrong with the round values? i.e.: the two new servers have IGNOREVALUE (sounds correct right?) and the older followers have the current round value (i.e.: 0x6). I thought the problem would be here: {noformat} * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732 */ outofelection.put(n.sid, new Vote(n.leader, IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)); if (termPredicate(outofelection, new Vote(n.leader, IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)) checkLeader(outofelection, n.leader, IGNOREVALUE)) { {noformat} IGNOREVALUE doesn't work here, because we are talking to un-patched cluster members. Sorry if I am completely misleading you :) That's as far as I got with my analysis today. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1804) Stat the realtime tps of zookeepr server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1804: -- Attachment: ZOOKEEPER-1804.patch we've been using this patch, so I guess something along this lines could work. Stat the realtime tps of zookeepr server Key: ZOOKEEPER-1804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1804 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Leader Ni Assignee: Leader Ni Attachments: ZOOKEEPER-1804.patch At this time, we assessed whether zookeeper supports some business scenarios, always use the number of subscribers, or to assess the number of clients。 You konw, some times, many client connection with zookeeper, but do noting, and the onthers do complex business logic。 So,we must stat the realtime tps of zookeepr。 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808194#comment-13808194 ] Raul Gutierrez Segales commented on ZOOKEEPER-1732: --- Hmm, I still think this could confuse people rolling a cluster. Sounds like we should revert this for the next release unless we have a fix for it. Smooth upgrades through rolling restarts are an expectation that ZooKeeper has always maintained. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1805) Don't care value in ZooKeeper election breaks rolling upgrades
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809386#comment-13809386 ] Raul Gutierrez Segales commented on ZOOKEEPER-1805: --- Testing this now - thanks for the patch [~fpj]. Don't care value in ZooKeeper election breaks rolling upgrades Key: ZOOKEEPER-1805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1805 Project: ZooKeeper Issue Type: Bug Reporter: Flavio Junqueira Priority: Blocker Attachments: ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch This is an issue that has been originally reported in ZOOKEEPER-1732. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1805) Don't care value in ZooKeeper election breaks rolling upgrades
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809419#comment-13809419 ] Raul Gutierrez Segales commented on ZOOKEEPER-1805: --- [~fpj]: did a quick test and followers joined nicely - thanks! Don't care value in ZooKeeper election breaks rolling upgrades Key: ZOOKEEPER-1805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1805 Project: ZooKeeper Issue Type: Bug Reporter: Flavio Junqueira Priority: Blocker Attachments: ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch This is an issue that has been originally reported in ZOOKEEPER-1732. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1805) Don't care value in ZooKeeper election breaks rolling upgrades
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809981#comment-13809981 ] Raul Gutierrez Segales commented on ZOOKEEPER-1805: --- Patch looks correct to me - thanks for the swift response [~fpj]. Don't care value in ZooKeeper election breaks rolling upgrades Key: ZOOKEEPER-1805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1805 Project: ZooKeeper Issue Type: Bug Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Attachments: ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch This is an issue that has been originally reported in ZOOKEEPER-1732. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1805) Don't care value in ZooKeeper election breaks rolling upgrades
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13810441#comment-13810441 ] Raul Gutierrez Segales commented on ZOOKEEPER-1805: --- 2nd test looks fine to me as well. I guess you could DRY it up a bit by having: {noformat} HashMapLong, Vote getVotes(long a, long b) { HashMapLong, Vote votes = new HashMapLong, Vote(); votes.put(0L, new Vote(4L, Vote.DONTCARE, Vote.DONTCARE, a, ServerState.FOLLOWING)); votes.put(1L, new Vote(4L, Vote.DONTCARE, Vote.DONTCARE, a, ServerState.FOLLOWING)); votes.put(3L, new Vote(4L, 10L, 10L, b, ServerState.FOLLOWING)); votes.put(4L, new Vote(4L, 10L, 10L, b, ServerState.LEADING)); return votes; } {noformat} but I guess copy/pasta is alright in test cases for readability (though I rather DRY it). Thanks [~fpj]. Don't care value in ZooKeeper election breaks rolling upgrades Key: ZOOKEEPER-1805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1805 Project: ZooKeeper Issue Type: Bug Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch This is an issue that has been originally reported in ZOOKEEPER-1732. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1805) Don't care value in ZooKeeper election breaks rolling upgrades
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811092#comment-13811092 ] Raul Gutierrez Segales commented on ZOOKEEPER-1805: --- +1. Don't care value in ZooKeeper election breaks rolling upgrades Key: ZOOKEEPER-1805 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1805 Project: ZooKeeper Issue Type: Bug Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1805-b3.4.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch, ZOOKEEPER-1805.patch This is an issue that has been originally reported in ZOOKEEPER-1732. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
Raul Gutierrez Segales created ZOOKEEPER-1807: - Summary: Observers spam each other creating connections to the election addr Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811782#comment-13811782 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Oh - fair enough. So I suspect QuorumCnxManager isn't doing the right thing then. Will take look. Thanks for the quick reply! Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811798#comment-13811798 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Actually - my initial assessment was wrong (the spammy there is already a connection.. message confused me).I am seeing an excess in traffic between Observers through the election port, but it's not due to connection attempts. I'll come back with the actual messages. Sorry if this isn't actually related to ZOOKEEPER-107, [~shralex]. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811802#comment-13811802 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Yes - absolutely [~fpj]. The amount of traffic that I am seeing between Observers through the election port is... scary. I am still trying to figure out what is going on. Will be back in a bit when I have a proper analysis. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811849#comment-13811849 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Okey - this seems to actually be related to ZOOKEEPER-107, [~shralex]. I added some debugging logging and I've see that the spam, to all Observers, are the notifications: {noformat} 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 9, peerEpoch = 130, configData = [B@5a0c0ce6 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 12, peerEpoch = 130, configData = [B@4d22fe39 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 6, peerEpoch = 130, configData = [B@346077bf 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 13, peerEpoch = 130, configData = [B@2955b776 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 11, peerEpoch = 130, configData = [B@3a7fb92d 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 14, peerEpoch = 130, configData = [B@1756575c 2013-11-02 02:33:21,341 - INFO [WorkerSender[myid=13]] - will send: leader = 3, zxid = 558362464215, electionEpoch = 5, state = OBSERVING, sid = 13, peerEpoch = 130, configData = [B@258164fc {noformat} As you can see, it's sending tons of notifications per second. Not good :) With this diff in FastLeaderElection.java (i.e.: a revert of part of your change): {noformat} private void sendNotifications() { -for (long sid : self.getAllKnownServerIds()) { +for (QuorumServer server : self.getVotingView().values()) { +long sid = server.id; {noformat} observers, of course, don't get spammed. I am guessing some condition is failing for Observers that assumes the notifications are fresh and sends them repeatedly? Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811850#comment-13811850 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- [~fpj]: I think this is 3.5.0 specific since it goes away whilst reverting those bits from ZOOKEEPER-107 (there is a chance I am overlooking something, of course, and it's some other thing). But this is most likely a blocker for the 3.5.0 release though. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811851#comment-13811851 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- [~thawan]: should omitting the Observers from zoo.cfg actually make any difference? If so we should document it somewhere (unless it already is is). In my case, where I do explicitly enumerate them I don't get observers-to-observers connections on the election port once I remove the bits I mentioned above in FLE (so it seems to me it isn't). Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1790) Deal with special ObserverId in QuorumCnxManager.receiveConnection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811852#comment-13811852 ] Raul Gutierrez Segales commented on ZOOKEEPER-1790: --- [~fpj]: I don't think it's related - my initial assessment was wrong. It isn't connection attempts that generate the extra traffic I am seeing but the Notifications (as commented in ZOOKEEPER-1807). Deal with special ObserverId in QuorumCnxManager.receiveConnection -- Key: ZOOKEEPER-1790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1790 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.6, 3.5.0 Reporter: Alexander Shraer Assignee: Alexander Shraer Fix For: 3.4.6, 3.5.0 QuorumCnxManager.receiveConnection assumes that a negative sid means that this is a 3.5.0 server, which has a different communication protocol. This doesn't account for the fact that ObserverId = -1 is a special id that may be used by observers and is also negative. This requires a fix to trunk and a separate fix to 3.4 branch, where this function is different (see ZOOKEEPER-1633) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811858#comment-13811858 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- I think what's happening is that when we send the initial notifications to all members, as opposed to just voting members as it was before, we trigger off a self-replicating cascade of notifications. Each Observers gets the notification and then by virtue of: {noformat} /* * If it is from a non-voting server (such as an observer or * a non-voting follower), respond right away. */ if(!self.getVotingView().containsKey(response.sid)){ . } {noformat} it replies back to each Observer and so on. So sounds to me that this needs to match what we have in sendNotifications and actually check response.sid against self.getAllKnownServerIds() to avoid the endless echoing of notifications that I am seeing. Thoughts [~shralex], [~fpj] ? Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1807: -- Attachment: ZOOKEEPER-1807.patch The attached patch prevents sending replies back when we are an Observer. Since ZOOKEEPER-107 we send notifications to Observers because they can be promoted to Participants. But to avoid replicating replies forver (i.e.: an observer sends a notification and the receiving observer then sends another one and so on) we don't have to send notifications when we are a LearnerType.OBSERVER. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13811993#comment-13811993 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Well, if we really need observer to observer responses, for reconfig purposes I presume, then should we be sending them to observers not in LOOKING state? See the conditions that apply when responding to participants in the lines below my patch. But even still with that being correct it might be too much overhead for large Observers deployments. Should this be optional? Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812100#comment-13812100 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Yeah - I think you are right. In this ZOOKEEPER-107 world in which observers can be promoted, etc the initial if() doesn't make sense anymore. I'll submit a new patch so we can think about it a bit more. With regards of the overhead and making all of this optional, well if you have 100 observers restarted at once you'll have a large of notifications traffic. But I guess within the limits of tolerable. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1807: -- Attachment: (was: ZOOKEEPER-1807.patch) Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1807: -- Attachment: ZOOKEEPER-1807.patch As discussed with [~shralex], we now need to apply the same response logic for voting and non-voting members. [~fpj], [~thawan] - thoughts? Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1807: -- Issue Type: Bug (was: New Feature) Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1807: -- Attachment: (was: ZOOKEEPER-1807.patch) Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13813077#comment-13813077 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- Thanks for the quick comment Alex. Yeah sounds to me that might be acceptable. Again, for huge deployments it might be a bit of concern since you'll be putting extra pressure on the cluster after, say, a big network partition. Thoughts? Cc: [~thawan], [~fpj]. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Raul Gutierrez Segales Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1807: -- Attachment: notifications-loop.png Here's how notification traffic (on election port 3888 in my case) goes down with the patch (i.e.: without the notifications loop). It's pretty dramatic so I'd say this is definitely a blocker for 3.5.0. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Germán Blanco Fix For: 3.5.0 Attachments: ZOOKEEPER-1807.patch, notifications-loop.png Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1804) Stat the realtime tps of zookeepr server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816056#comment-13816056 ] Raul Gutierrez Segales commented on ZOOKEEPER-1804: --- Small styling nits, in things like: {noformat} + if ( zkServer == null ) { + pw.println( ZK_NOT_SERVING ); {noformat} the spaces after ( and before ) aren't used in the rest of the code. Also - at the cost of introducing another dependency though - you might want to check Twitter's stats package which has convenience classes/methods for keeping stats (also useful for the case of write/read latency to keep p99, etc): http://twitter.github.io/commons/apidocs/#com.twitter.common.stats.Stats We use this in an internal branch atm. Stat the realtime tps of zookeepr server Key: ZOOKEEPER-1804 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1804 Project: ZooKeeper Issue Type: Improvement Components: server Reporter: Leader Ni Assignee: Leader Ni Fix For: 3.5.0 Attachments: ZOOKEEPER-1804.patch, ZOOKEEPER-1804.patch At this time, we assessed whether zookeeper supports some business scenarios, always use the number of subscribers, or to assess the number of clients。 You konw, some times, many client connection with zookeeper, but do noting, and the onthers do complex business logic。 So,we must stat the realtime tps of zookeepr。 [-Solution---] Solution1: If you only want to know the real time transaction processed, you can use the patch ZOOKEEPER-1804.patch. Solution2: If you also want to know how client use zookeeper, and the real time r/w ps of each zookeeper client, you can use the patch ZOOKEEPER-1804-2.patch use java properties: -Dserver_process_stats=true to open the function. Sample: $echo rwps|nc localhost 2181 RealTime R/W Statistics: getChildren2: 0.5994005994005994 createSession: 1.6983016983016983 closeSession: 0.999000999000999 setData: 110.18981018981019 setWatches: 129.17082917082917 getChildren: 68.83116883116884 delete: 19.980019980019982 create: 22.27772227772228 exists: 1806.2937062937062 getDate: 729.5704295704296 -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816152#comment-13816152 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- [~shralex]: do you think that, perhaps, adding a comment elaborating a bit more on the rationale of notifications and the state of the new/old config would be worthwhile? I am thinking the comment should be along sendNotifications(). Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Alexander Shraer Fix For: 3.5.0 Attachments: ZOOKEEPER-1807-alex.patch, ZOOKEEPER-1807-ver2.patch, ZOOKEEPER-1807-ver3.patch, ZOOKEEPER-1807.patch, notifications-loop.png Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816157#comment-13816157 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- (could we get a reviewboard for this? some inline comments below) For: {noformat} +// start server 3 with new config +zk[2] = new ZooKeeper(127.0.0.1: + ports[2][2], ClientBase.CONNECTION_TIMEOUT, this); {noformat} I think the zk[2] assignment goes before the comment. For: {noformat} +for (int i=2; i3; i++) { +Assert.assertTrue(waiting for server + i + being up, +ClientBase.waitForServerUp(127.0.0.1: + ports[i][2], +CONNECTION_TIMEOUT * 2)); +ReconfigTest.testServerHasConfig(zk[i], allServersNext, null); +} {noformat} i= 3? Or no loop if you only want it to loop one time I guess. Also the ports assignment loop and the currentQuorumCfgSection creation are repeated in testObserverConvertedToParticipantDuringFLE and testCurrentObserverIsParticipantInNewConfig; mind DRY-ing this up a bit by putting those in private methods? (i.e.: generatePorts() and generateInitialConfig() or such such). Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Alexander Shraer Fix For: 3.5.0 Attachments: ZOOKEEPER-1807-alex.patch, ZOOKEEPER-1807-ver2.patch, ZOOKEEPER-1807-ver3.patch, ZOOKEEPER-1807.patch, notifications-loop.png Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1808) Add version to FLE notifications for 3.4 branch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820599#comment-13820599 ] Raul Gutierrez Segales commented on ZOOKEEPER-1808: --- Some stylistic nits: {noformat} +requestBuffer.putLong(epoch); +requestBuffer.putInt( Notification.CURRENTVERSION ); {noformat} no spaces between parenthesis and parameters. {noformat} +if(response.buffer.remaining() = 4) { +n.version = response.buffer.getInt(); +} else { +n.version = 0x0; +} {noformat} More succinct: {noformat} + n.version ? response.buffer.remaining() = 4 : 0x0; {noformat} Nit: {noformat} private void printNotification(Notification n){ -LOG.info(Notification: + n.leader + (n.leader), 0x +LOG.info(Notification: + Long.toHexString(n.version) + (message format version), ... {noformat} Maybe that belongs as toString inside Notification? Super nit: there's two extra newlines in src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java. Add version to FLE notifications for 3.4 branch --- Key: ZOOKEEPER-1808 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1808 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Fix For: 3.4.6 Attachments: ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch Add version to notification messages so that we can differentiate messages during rolling upgrades. This task is for the 3.4 branch only. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1808) Add version to FLE notifications for 3.4 branch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13820600#comment-13820600 ] Raul Gutierrez Segales commented on ZOOKEEPER-1808: --- Sorry meant: {noformat} n.version = response.buffer.remaining() = 4 ? response.buffer.getInt() : 0x0; {noformat} Add version to FLE notifications for 3.4 branch --- Key: ZOOKEEPER-1808 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1808 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Fix For: 3.4.6 Attachments: ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch Add version to notification messages so that we can differentiate messages during rolling upgrades. This task is for the 3.4 branch only. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1810) Add version to FLE notifications for trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822797#comment-13822797 ] Raul Gutierrez Segales commented on ZOOKEEPER-1810: --- I think that: {noformat} +if(LOG.isInfoEnabled()){ +LOG.info(Backward compatibility mode (36 bits), server id: + response.sid); +} {noformat} can do without the LOG.isInfoEnabled since it's already called by LOG.info and response.sid isn't computed (just a value accessed, so no savings). Add version to FLE notifications for trunk -- Key: ZOOKEEPER-1810 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1810 Project: ZooKeeper Issue Type: Sub-task Affects Versions: 3.5.0 Reporter: Flavio Junqueira Assignee: Germán Blanco Fix For: 3.5.0 Attachments: ZOOKEEPER-1810.patch, ZOOKEEPER-1810.patch The same as ZOOKEEPER-1808 but for trunk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1810) Add version to FLE notifications for trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823302#comment-13823302 ] Raul Gutierrez Segales commented on ZOOKEEPER-1810: --- Yeah - sorry that was a bit confusing. I guess - if it isn't too much of a hassle - reviewboards to make things a bit easier. Add version to FLE notifications for trunk -- Key: ZOOKEEPER-1810 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1810 Project: ZooKeeper Issue Type: Sub-task Affects Versions: 3.5.0 Reporter: Flavio Junqueira Assignee: Germán Blanco Fix For: 3.5.0 Attachments: ZOOKEEPER-1810.patch, ZOOKEEPER-1810.patch The same as ZOOKEEPER-1808 but for trunk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1810) Add version to FLE notifications for trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823303#comment-13823303 ] Raul Gutierrez Segales commented on ZOOKEEPER-1810: --- (I meant for future patches - we can keep on going with this one inside the ticket if it's easier.) Add version to FLE notifications for trunk -- Key: ZOOKEEPER-1810 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1810 Project: ZooKeeper Issue Type: Sub-task Affects Versions: 3.5.0 Reporter: Flavio Junqueira Assignee: Germán Blanco Fix For: 3.5.0 Attachments: ZOOKEEPER-1810.patch, ZOOKEEPER-1810.patch The same as ZOOKEEPER-1808 but for trunk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1808) Add version to FLE notifications for 3.4 branch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823732#comment-13823732 ] Raul Gutierrez Segales commented on ZOOKEEPER-1808: --- (Meant FLETestUtils.createMsg() is called again and again with the same params). Add version to FLE notifications for 3.4 branch --- Key: ZOOKEEPER-1808 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1808 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Fix For: 3.4.6 Attachments: ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch, ZOOKEEPER-1808.patch Add version to notification messages so that we can differentiate messages during rolling upgrades. This task is for the 3.4 branch only. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1807) Observers spam each other creating connections to the election addr
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823924#comment-13823924 ] Raul Gutierrez Segales commented on ZOOKEEPER-1807: --- I am happy to give the RB a shipit but I would prefer to have more feedback/reviews from [~thawan] and [~fpj] since they are more familiar with the internals of FLE. Observers spam each other creating connections to the election addr --- Key: ZOOKEEPER-1807 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807 Project: ZooKeeper Issue Type: Bug Reporter: Raul Gutierrez Segales Assignee: Alexander Shraer Fix For: 3.5.0 Attachments: ZOOKEEPER-1807-alex.patch, ZOOKEEPER-1807-ver2.patch, ZOOKEEPER-1807-ver3.patch, ZOOKEEPER-1807-ver4.patch, ZOOKEEPER-1807-ver5.patch, ZOOKEEPER-1807.patch, notifications-loop.png Hey [~shralex], I noticed today that my Observers are spamming each other trying to open connections to the election port. I've got tons of these: {noformat} 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 9 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 10 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 6 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 12 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a connection already for server 14 {noformat} and so and so on ad nauseam. Now, looking around I found this inside FastLeaderElection.java from when you committed ZOOKEEPER-107: {noformat} private void sendNotifications() { -for (QuorumServer server : self.getVotingView().values()) { -long sid = server.id; - +for (long sid : self.getAllKnownServerIds()) { +QuorumVerifier qv = self.getQuorumVerifier(); {noformat} Is that really desired? I suspect that is what's causing Observers to try to connect to each other (as opposed as just connecting to participants). I'll give it a try now and let you know. (Also, we use observer ids that are 0, and I saw some parts of the code that might not deal with that assumption - so it could be that too..). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1810) Add version to FLE notifications for trunk
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824620#comment-13824620 ] Raul Gutierrez Segales commented on ZOOKEEPER-1810: --- Doesn't look like https://reviews.apache.org/r/15568/ was updated? Or should we continue the review (and give the +1s) here? Add version to FLE notifications for trunk -- Key: ZOOKEEPER-1810 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1810 Project: ZooKeeper Issue Type: Sub-task Affects Versions: 3.5.0 Reporter: Flavio Junqueira Assignee: Germán Blanco Fix For: 3.5.0 Attachments: ZOOKEEPER-1810.patch, ZOOKEEPER-1810.patch, ZOOKEEPER-1810.patch, ZOOKEEPER-1810.patch The same as ZOOKEEPER-1808 but for trunk. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824622#comment-13824622 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- With the mix of inline and reviewboard reviews I am not sure where we should review this one :) Is there a reviewboard for this one as well or just inline? If there is mind adding the link here for posterity - thanks [~fpj]. Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824625#comment-13824625 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- Ah - the rb is https://reviews.apache.org/r/15625/. Though it's having issues - maybe try reloading? I guess reviewboard applies against the git mirrors and there was a lag in Apache's git-svn sync yesterday (i think). Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1653) zookeeper fails to start because of inconsistent epoch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824632#comment-13824632 ] Raul Gutierrez Segales commented on ZOOKEEPER-1653: --- I take back the last comment, I carelessly overlooked the inheriting class. zookeeper fails to start because of inconsistent epoch -- Key: ZOOKEEPER-1653 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1653 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Fix For: 3.4.6 Attachments: ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.patch, ZOOKEEPER-1653.patch It looks like QuorumPeer.loadDataBase() could fail if the server was restarted after zk.takeSnapshot() but before finishing self.setCurrentEpoch(newEpoch) in Learner.java. {code:java} case Leader.NEWLEADER: // it will be NEWLEADER in v1.0 zk.takeSnapshot(); self.setCurrentEpoch(newEpoch); // got restarted here snapshotTaken = true; writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true); break; {code} The server fails to start because currentEpoch is still 1 but the last processed zkid from the snapshot has been updated. {noformat} 2013-02-20 13:45:02,733 5543 [pool-1-thread-1] ERROR org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on disk java.io.IOException: The current epoch, 1, is older than the last zxid, 8589934592 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:439) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:413) ... {noformat} {noformat} $ find datadir datadir datadir/version-2 datadir/version-2/currentEpoch.tmp datadir/version-2/acceptedEpoch datadir/version-2/snapshot.0 datadir/version-2/currentEpoch datadir/version-2/snapshot.2 $ cat datadir/version-2/currentEpoch.tmp 2% $ cat datadir/version-2/acceptedEpoch 2% $ cat datadir/version-2/currentEpoch 1% {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1653) zookeeper fails to start because of inconsistent epoch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824630#comment-13824630 ] Raul Gutierrez Segales commented on ZOOKEEPER-1653: --- Nit in: {noformat} +static void writeLongToFile(File file, long value) throws IOException { +AtomicFileOutputStream out = new AtomicFileOutputStream(file); +BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(out)); +boolean aborted = false; +try { +bw.write(Long.toString(value)); +bw.flush(); +out.flush(); +out.close(); +} catch (IOException e) { +LOG.error(Failed to write new file + file, e); +out.abort(); +throw e; +} +} {noformat} aborted is not used. Nit in: {noformat} +LOG.info(Validating current epoch: + servers.mt[i].dataDir); {noformat} use {} instead of concatenating. Nit: {noformat} +// Shut down the cluster {noformat} should be Shutdown the cluster. In src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerTestBase.java: {noformat} +CountDownLatch mainFailed; {noformat} is assigned and modified but never asserted or checked? zookeeper fails to start because of inconsistent epoch -- Key: ZOOKEEPER-1653 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1653 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Fix For: 3.4.6 Attachments: ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.patch, ZOOKEEPER-1653.patch It looks like QuorumPeer.loadDataBase() could fail if the server was restarted after zk.takeSnapshot() but before finishing self.setCurrentEpoch(newEpoch) in Learner.java. {code:java} case Leader.NEWLEADER: // it will be NEWLEADER in v1.0 zk.takeSnapshot(); self.setCurrentEpoch(newEpoch); // got restarted here snapshotTaken = true; writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true); break; {code} The server fails to start because currentEpoch is still 1 but the last processed zkid from the snapshot has been updated. {noformat} 2013-02-20 13:45:02,733 5543 [pool-1-thread-1] ERROR org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on disk java.io.IOException: The current epoch, 1, is older than the last zxid, 8589934592 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:439) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:413) ... {noformat} {noformat} $ find datadir datadir datadir/version-2 datadir/version-2/currentEpoch.tmp datadir/version-2/acceptedEpoch datadir/version-2/snapshot.0 datadir/version-2/currentEpoch datadir/version-2/snapshot.2 $ cat datadir/version-2/currentEpoch.tmp 2% $ cat datadir/version-2/acceptedEpoch 2% $ cat datadir/version-2/currentEpoch 1% {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1653) zookeeper fails to start because of inconsistent epoch
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824634#comment-13824634 ] Raul Gutierrez Segales commented on ZOOKEEPER-1653: --- One more nit in src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java: {noformat} +currentEpochFile = new File(new File(follower.dataDir, version-2), +currentEpoch); +File updatingEpochFile = new File( +new File(follower.dataDir, version-2), +QuorumPeer.UPDATING_EPOCH_FILENAME); {noformat} could be abbreviated with: {noformat} +File followerDataDir = new File(follower.dataDir, version-2); +currentEpochFile = new File(followerDataDir, currentEpoch); +File updatingEpochFile = new File(followerDataDir, QuorumPeer.UPDATING_EPOCH_FILENAME); {noformat} Also - should there be a constant for currentEpoch too? zookeeper fails to start because of inconsistent epoch -- Key: ZOOKEEPER-1653 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1653 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Fix For: 3.4.6 Attachments: ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.3.4.patch, ZOOKEEPER-1653.patch, ZOOKEEPER-1653.patch It looks like QuorumPeer.loadDataBase() could fail if the server was restarted after zk.takeSnapshot() but before finishing self.setCurrentEpoch(newEpoch) in Learner.java. {code:java} case Leader.NEWLEADER: // it will be NEWLEADER in v1.0 zk.takeSnapshot(); self.setCurrentEpoch(newEpoch); // got restarted here snapshotTaken = true; writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true); break; {code} The server fails to start because currentEpoch is still 1 but the last processed zkid from the snapshot has been updated. {noformat} 2013-02-20 13:45:02,733 5543 [pool-1-thread-1] ERROR org.apache.zookeeper.server.quorum.QuorumPeer - Unable to load database on disk java.io.IOException: The current epoch, 1, is older than the last zxid, 8589934592 at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:439) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:413) ... {noformat} {noformat} $ find datadir datadir datadir/version-2 datadir/version-2/currentEpoch.tmp datadir/version-2/acceptedEpoch datadir/version-2/snapshot.0 datadir/version-2/currentEpoch datadir/version-2/snapshot.2 $ cat datadir/version-2/currentEpoch.tmp 2% $ cat datadir/version-2/acceptedEpoch 2% $ cat datadir/version-2/currentEpoch 1% {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1573) Unable to load database due to missing parent node
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824644#comment-13824644 ] Raul Gutierrez Segales commented on ZOOKEEPER-1573: --- Nit - maybe this: {noformat} + * Snapshots are lazily created. So when the snapshot was in progress + * there is a chance that some of the later transactions can go into + * snapshot. While restoring same transactions NONODE/NODEEXISTS errors + * can come. Basically we can ignore all errors during the restore. {noformat} could be more clear like this: {noformat} + * Snapshots are lazily created. So when a snapshot is in progress, + * there is a chance for later transactions to make to into the snapshot. + * Then when the snapshot is restored, NONODE/NODEEXISTS errors + * could occur. It should be safe to ignore these. {noformat} Nit: {noformat} +LOG.warn(Intrrupted); {noformat} typo. Nit: {noformat} +LOG.debug(Ignoring processTxn failure hdr: + hdr.getType() + : error: + rc.err + path: + rc.path); {noformat} use string extrapolation with {} instead of string concatenation. Nit: {noformat} +/** + * Test we can restore a snapshot that has delete txns ahead of the zxid of the snapshot file. ZOOKEEPER-1573 + */ {noformat} make it: {noformat} +/** + * ZOOKEEPER-1573: test restoring a snapshot with deleted txns ahead of the snapshot file's zxid. + */ {noformat} Nit: {noformat} +LOG.info(Set lastProcessedZxid to + zks.getZKDatabase().getDataTreeLastProcessedZxid()); {noformat} ditto wrt to string extrapolation via {}. Unable to load database due to missing parent node -- Key: ZOOKEEPER-1573 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1573 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3, 3.5.0 Reporter: Thawan Kooburat Attachments: ZOOKEEPER-1573.patch While replaying txnlog on data tree, the server has a code to detect missing parent node. This code block was last modified as part of ZOOKEEPER-1333. In our production, we found a case where this check is return false positive. The sequence of txns is as follows: zxid 1: create /prefix/a zxid 2: create /prefix/a/b zxid 3: delete /prefix/a/b zxid 4: delete /prefix/a The server start capturing snapshot at zxid 1. However, by the time it traversing the data tree down to /prefix, txn 4 is already applied and /prefix have no children. When the server restore from snapshot, it process txnlog starting from zxid 2. This txn generate missing parent error and the server refuse to start up. The same check allow me to discover bug in ZOOKEEPER-1551, but I don't know if we have any option beside removing this check to solve this issue. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1573) Unable to load database due to missing parent node
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824934#comment-13824934 ] Raul Gutierrez Segales commented on ZOOKEEPER-1573: --- Thanks for the quick update Vinay. A few more nits: In: {noformat} + * Snapshots are lazily created. So when a snapshot is in progress, + * there is a chance for later transactions to make to into the snapshot. + * Then when the snapshot is restored, NONODE/NODEEXISTS errors + * could occur. It should be safe to ignore these. {noformat} I had a typo in my suggestion (sorry) - it should be: ...transactions to make into the snapshot. In: {noformat} +LOG.debug(Ignoring processTxn failure hdr: {}, error: {}, path: {}, hdr.getType(), rc.err, rc.path); {noformat} the line is too long, maybe: {noformat} +LOG.debug(Ignoring processTxn failure hdr: {}, error: {}, path: {}, + hdr.getType(), rc.err, rc.path); {noformat} In: {noformat} +Assert.assertTrue(waiting for server being up , ClientBase.waitForServerUp(HOSTPORT, CONNECTION_TIMEOUT)); {noformat} line is too long. And this line appears again later on. Unable to load database due to missing parent node -- Key: ZOOKEEPER-1573 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1573 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3, 3.5.0 Reporter: Thawan Kooburat Attachments: ZOOKEEPER-1573.patch, ZOOKEEPER-1573.patch While replaying txnlog on data tree, the server has a code to detect missing parent node. This code block was last modified as part of ZOOKEEPER-1333. In our production, we found a case where this check is return false positive. The sequence of txns is as follows: zxid 1: create /prefix/a zxid 2: create /prefix/a/b zxid 3: delete /prefix/a/b zxid 4: delete /prefix/a The server start capturing snapshot at zxid 1. However, by the time it traversing the data tree down to /prefix, txn 4 is already applied and /prefix have no children. When the server restore from snapshot, it process txnlog starting from zxid 2. This txn generate missing parent error and the server refuse to start up. The same check allow me to discover bug in ZOOKEEPER-1551, but I don't know if we have any option beside removing this check to solve this issue. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824970#comment-13824970 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- One more nit (sorry [~fpj]) in: {noformat} -return ( + id + , + Long.toHexString(zxid) + , + Long.toHexString(peerEpoch) + ); +return ( + id + , + + Long.toHexString(zxid) + + , + Long.toHexString(peerEpoch) + + ); {noformat} should we encourage String.format instead of concatenation (as we do in LOG statements with {})? I think this is more readable: {noformat} -return ( + id + , + Long.toHexString(zxid) + , + Long.toHexString(peerEpoch) + ); +return String.format((%d, %s, %s), id, Long.toHexString(zxid), Long.toHexString(peerEpoch)); {noformat} What do you think? Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13824972#comment-13824972 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- (of course, with the proper line wrap for 80 chars). Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1573) Unable to load database due to missing parent node
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825014#comment-13825014 ] Raul Gutierrez Segales commented on ZOOKEEPER-1573: --- Last nit (though feel free to ignore it since it refers to improving old code as well): {noformat} + +long start = System.currentTimeMillis(); +while (!connected) { +long end = System.currentTimeMillis(); +if (end - start 5000) { +Assert.assertTrue(Could not connect with server in 5 seconds, +false); +} +try { +Thread.sleep(200); +} catch (Exception e) { +LOG.warn(Interrupted); +} +} {noformat} this is copy/pasted for two other tests as well - can we move it to a method called waitConnected and call that instead? It'll make tests shorted and more readable I think. Unable to load database due to missing parent node -- Key: ZOOKEEPER-1573 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1573 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.3, 3.5.0 Reporter: Thawan Kooburat Attachments: ZOOKEEPER-1573.patch, ZOOKEEPER-1573.patch, ZOOKEEPER-1573.patch While replaying txnlog on data tree, the server has a code to detect missing parent node. This code block was last modified as part of ZOOKEEPER-1333. In our production, we found a case where this check is return false positive. The sequence of txns is as follows: zxid 1: create /prefix/a zxid 2: create /prefix/a/b zxid 3: delete /prefix/a/b zxid 4: delete /prefix/a The server start capturing snapshot at zxid 1. However, by the time it traversing the data tree down to /prefix, txn 4 is already applied and /prefix have no children. When the server restore from snapshot, it process txnlog starting from zxid 2. This txn generate missing parent error and the server refuse to start up. The same check allow me to discover bug in ZOOKEEPER-1551, but I don't know if we have any option beside removing this check to solve this issue. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826775#comment-13826775 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- Sorry guys I couldn't test this - don't have a 3.4 setup handy. Will do proper testing with trunk though (and of course, the nits in both cases ;-). And thanks for testing [~abranzyck]. Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, logs.tar.gz, logs2.tar.gz See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13827146#comment-13827146 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- (as said before, i have only been testing the upstream version of this ticket) Some nits: {noformat} +if ((state == ServerState.LOOKING) || +(other.state == ServerState.LOOKING)) { +return (id == other.id zxid == other.zxid electionEpoch == other.electionEpoch peerEpoch == other.peerEpoch); +} else { +if (version == other.version) { +return (id == other.id + peerEpoch == other.peerEpoch); +} else { +return id == other.id; +} +} +} {noformat} could be simplified to: {noformat} +if ((state == ServerState.LOOKING) || +(other.state == ServerState.LOOKING)) { +return (id == other.id zxid == other.zxid electionEpoch == other.electionEpoch peerEpoch == other.peerEpoch); +} else if (version == other.version) { +return id == other.id peerEpoch == other.peerEpoch; +} + +return id == other.id; +} {noformat} In src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java: {noformat} +//import org.apache.zookeeper.server.quorum.QuorumCnxManager; {noformat} just delete that line? In src/java/test/org/apache/zookeeper/server/quorum/FLEDontCareTest.java I guess testOutofElection is still work in progress because of the commented code? Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, logs.tar.gz, logs2.tar.gz See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1787) Add support for enabling local session in rolling upgrade
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1787: -- Summary: Add support for enabling local session in rolling upgrade (was: Add support enabling local session in rolling upgrade) Add support for enabling local session in rolling upgrade - Key: ZOOKEEPER-1787 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1787 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Thawan Kooburat Priority: Minor Currently, local session need to be enable by stopping the entire ensemble. If a rolling upgrade is used, all write request from a local session will fail with session move until the local session is enabled on the leader. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (ZOOKEEPER-1787) Add support for enabling local session in rolling upgrade
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1787: -- Attachment: ZOOKEEPER-1787.patch With this patch you can use -Dzookeeper.skipSessionValidation=yes to enable local sessions with a rolling upgrade. We should probably add some documentation as well - but lets agree on this patch first. Add support for enabling local session in rolling upgrade - Key: ZOOKEEPER-1787 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1787 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Thawan Kooburat Priority: Minor Attachments: ZOOKEEPER-1787.patch Currently, local session need to be enable by stopping the entire ensemble. If a rolling upgrade is used, all write request from a local session will fail with session move until the local session is enabled on the leader. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-602) log all exceptions not caught by ZK threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829232#comment-13829232 ] Raul Gutierrez Segales commented on ZOOKEEPER-602: -- Two nits: {noformat} +class LearnerCnxAcceptor extends ZooKeeperThread{ {noformat} space missing (ZooKeeperThread {) Typo: {noformat} +// When there is no worker thread pool, do the work directly +// and waiting for its completion {noformat} and wait for its completion log all exceptions not caught by ZK threads --- Key: ZOOKEEPER-602 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-602 Project: ZooKeeper Issue Type: Bug Components: java client, server Affects Versions: 3.2.1 Reporter: Patrick Hunt Assignee: Rakesh R Priority: Critical Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch the java code should add a ThreadGroup exception handler that logs at ERROR level any uncaught exceptions thrown by Thread run methods. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1817) Fix don't care for b3.4
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830155#comment-13830155 ] Raul Gutierrez Segales commented on ZOOKEEPER-1817: --- Even though I didn't actually test this it looks correct to me (and nit-less ;-) ) - so +1. Will do proper testing for https://issues.apache.org/jira/browse/ZOOKEEPER-1818 once https://issues.apache.org/jira/browse/ZOOKEEPER-1810 lands. Fix don't care for b3.4 --- Key: ZOOKEEPER-1817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1817 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Priority: Blocker Fix For: 3.4.6 Attachments: ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, ZOOKEEPER-1817.patch, logs.tar.gz, logs2.tar.gz See umbrella jira. -- This message was sent by Atlassian JIRA (v6.1#6144)