[jira] [Commented] (ZOOKEEPER-3106) Zookeeper client supports IPv6 address and document the "IPV6 feature"
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605227#comment-16605227 ] Hadoop QA commented on ZOOKEEPER-3106: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2124//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2124//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2124//console This message is automatically generated. > Zookeeper client supports IPv6 address and document the "IPV6 feature" > -- > > Key: ZOOKEEPER-3106 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3106 > Project: ZooKeeper > Issue Type: Improvement > Components: documentation, java client >Reporter: maoling >Assignee: maoling >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > This issue is the follow-up work of > [ZOOKEEPER-3057|https://issues.apache.org/jira/browse/ZOOKEEPER-3057] > 1.ZK server side supports ipv6 style like this: > server.1=[2001:db8:1::242:ac11:2]:2888:3888,but zk client side supports ipv6 > like this:2001:db8:1::242:ac11:2:2181.we need unify them. > Look at the kafka example > [KAFKA-1123|https://issues.apache.org/jira/browse/KAFKA-1123]. its producer > client also supports ipv6 like this: [2001:db8:1::242:ac11:2] > 2.document the "IPV6 feature" to let user know. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3090) change continue to break
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605072#comment-16605072 ] Hudson commented on ZOOKEEPER-3090: --- SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #178 (See [https://builds.apache.org/job/ZooKeeper-trunk/178/]) ZOOKEEPER-3090: continue can be replaced with break (hanm: rev 0b65e3d4cbb7647ea692bfea59c02e5b24bf821c) * (edit) src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java > change continue to break > > > Key: ZOOKEEPER-3090 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3090 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: zhangbo >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > it's useful and enough to change continue to break,especially when call > getLogFiles(logDir.listFiles(), 0) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 And yes like others commented, please rebase and resolve conflicts, and reply to previous unattended comments. My previous comments need address are: * Existing unit test failures and one findbug issue. * Are there more types of metrics going to be added, such as Gauge? * Would be good to have some tests around the system. ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 One question about gathering metrics using this system: let's say I have a use case where I want to gather the number of requests processed by FinalRequestProcessor for each request type (e.g. Ping, CreateSession, and other OpCode types). How to do this using this metrics system? I think I have to add, for each OpCode, a new enum type in ServerMetrics, is this correct? ---
[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/580 We at Twitter manage large ZK clusters and we also have our own internal metrics system (you can't really operate things like ZK without metrics at large scale.). Our design is similar like this in terms of metrics collection and code instrumentation, but we coupled the metrics collection with publication. We are publishing metrics through Finagle (Twitter's RPC library) and the tight coupling caused several issues, such as circular dependencies between Finagle and ZooKeeper, where ZK depends on Finagle for metrics, and Finagle depends on ZK for ServerSets (naming resolution). I think the current design proposed in the community would solve this problem. Basically we will have two systems: * The ZooKeeper metrics system for metrics collection. * The pluggable system @eolivelli is working on is the backends for the metrics. So IMO both work are sort of orthogonal and can be done in parallel. And I agree with the design decisions @jtuple proposed and @eolivelli commented previously. Regarding use an external metrics type system: I think it would be more flexible for us to define and own the metrics definitions so it's easy to add / change the system without depends on third parties. ---
[GitHub] zookeeper issue #612: [ZOOKEEPER-3131] Remove watcher when session closed in...
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/612 Close and reopen the pull request could also trigger a build, which might be more convenient than amending the commit. For committers or those who have admin access to apache JIRA, just find the job associated with the PR, and click the "Rebuild" button. ---
[GitHub] zookeeper pull request #573: ZOOKEEPER-3090 continue can be replaced with br...
Github user asfgit closed the pull request at: https://github.com/apache/zookeeper/pull/573 ---
[GitHub] zookeeper issue #615: ZOOKEEPER-3137: add a utility to truncate logs to a zx...
Github user enixon commented on the issue: https://github.com/apache/zookeeper/pull/615 @anmolnar , you can find a trial integration of the chop function with TxnLogToolkit at https://github.com/enixon/zookeeper/commit/b71ce04f70873f590c6cf5c24a691a1f1af82f48 . It's not the cleanest integration since there's not much direct overlap between the dump method and the chop method but I think we can make some useful decisions here that will help centralize work on the tool for future patches. Let me know what you think! ---
[GitHub] zookeeper issue #573: ZOOKEEPER-3090 continue can be replaced with break
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/573 I manually triggered a build and it's green now https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2122/ Github integration still broken for this PR for some reasons. This has been reported before, but i don't know why it's broken. I am committing this as it's pretty obvious and reviewed by multiple folks already. ---
[GitHub] zookeeper issue #615: ZOOKEEPER-3137: add a utility to truncate logs to a zx...
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/615 Integrate into TxnLogToolkit is a good idea. I did not mention it as I thought that might require a separate JIRA, because LogFormatter can also be integrated (it's the least powerful tool of the three) and then deprecated, and also the current place of TxnLogToolkit is a little bit weird, IMO server/util makes more sense than server/persistence. ---
[jira] [Commented] (ZOOKEEPER-3137) add a utility to truncate logs to a zxid
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605042#comment-16605042 ] Hadoop QA commented on ZOOKEEPER-3137: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 2 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2123//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2123//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2123//console This message is automatically generated. > add a utility to truncate logs to a zxid > > > Key: ZOOKEEPER-3137 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3137 > Project: ZooKeeper > Issue Type: New Feature >Affects Versions: 3.6.0 >Reporter: Brian Nixon >Priority: Trivial > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Add a utility that allows an admin to truncate a given transaction log to a > specified zxid. This can be similar to the existent LogFormatter. > Among the benefits, this allows an admin to put together a point-in-time view > of a data tree by manually mutating files from a saved backup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper issue #615: ZOOKEEPER-3137: add a utility to truncate logs to a zx...
Github user enixon commented on the issue: https://github.com/apache/zookeeper/pull/615 @hanm, I've made your recommended changes with the exception noted in my comment. @anmolnar , looking at TxnLogToolkit now - I think this should be easy to integrate :) ---
[GitHub] zookeeper pull request #615: ZOOKEEPER-3137: add a utility to truncate logs ...
Github user enixon commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/615#discussion_r215446231 --- Diff: src/java/main/org/apache/zookeeper/server/util/LogChopper.java --- @@ -0,0 +1,152 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.zookeeper.server.util; + +import org.apache.jute.BinaryInputArchive; +import org.apache.jute.BinaryOutputArchive; +import org.apache.jute.Record; +import org.apache.zookeeper.server.persistence.FileHeader; +import org.apache.zookeeper.server.persistence.FileTxnLog; +import org.apache.zookeeper.txn.TxnHeader; + +import java.io.BufferedInputStream; +import java.io.BufferedOutputStream; +import java.io.EOFException; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.util.zip.Adler32; +import java.util.zip.Checksum; + +/** + * this class will chop the log at the specified zxid + */ +public class LogChopper { +public static void main(String args[]) { +if (args.length != 3) { +System.out.println("Usage: LogChopper zxid_to_chop_to txn_log_to_chop chopped_filename"); +System.out.println("this program will read the txn_log_to_chop file and copy all the transactions"); +System.out.println("from it up to (and including) the given zxid into chopped_filename."); +System.exit(1); +} +long zxid = Long.decode(args[0]); +String txnLog = args[1]; +String choppedLog = args[2]; + +int rc = 2; +try ( +InputStream is = new BufferedInputStream(new FileInputStream(txnLog)); +OutputStream os = new BufferedOutputStream(new FileOutputStream(choppedLog)) +) { +if (chop(is, os, zxid)) { +rc = 0; +} +} catch (Exception e) { +System.out.println("Got exception: " + e.getMessage()); +} +System.exit(rc); +} + +public static boolean chop(InputStream is, OutputStream os, long zxid) throws IOException { +BinaryInputArchive logStream = BinaryInputArchive.getArchive(is); +BinaryOutputArchive choppedStream = BinaryOutputArchive.getArchive(os); +FileHeader fhdr = new FileHeader(); +fhdr.deserialize(logStream, "fileheader"); + +if (fhdr.getMagic() != FileTxnLog.TXNLOG_MAGIC) { +System.err.println("Invalid magic number in txn log file"); +return false; +} +System.out.println("ZooKeeper Transactional Log File with dbid " ++ fhdr.getDbid() + " txnlog format version " ++ fhdr.getVersion()); + +fhdr.serialize(choppedStream, "fileheader"); +int count = 0; +boolean hasZxid = false; +long previousZxid = -1; +while (true) { +long crcValue; +byte[] bytes; +try { +crcValue = logStream.readLong("crcvalue"); + +bytes = logStream.readBuffer("txnEntry"); +} catch (EOFException e) { +System.out.println("EOF reached after " + count + " txns."); +// returning false because nothing was chopped +return false; +} +if (bytes.length == 0) { +// Since we preallocate, we define EOF to be an +// empty transaction +System.out.println("EOF reached after " + count + " txns."); +// returning false because nothing was chopped +return false; +} + +Checksum crc = new Adler32(); +crc.update(bytes, 0, bytes.length); +if
[jira] [Commented] (ZOOKEEPER-3127) Fixing potential data inconsistency due to update last processed zxid with partial multi-op txn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605011#comment-16605011 ] Hudson commented on ZOOKEEPER-3127: --- SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #177 (See [https://builds.apache.org/job/ZooKeeper-trunk/177/]) ZOOKEEPER-3127: Fixing potential data inconsistency due to update last (hanm: rev e501d9cc67fbaa6e825292fd838711259b6c9789) * (edit) src/java/main/org/apache/zookeeper/server/ZKDatabase.java * (edit) src/java/main/org/apache/zookeeper/server/DataTree.java * (edit) src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java * (add) src/java/test/org/apache/zookeeper/server/quorum/FuzzySnapshotRelatedTest.java > Fixing potential data inconsistency due to update last processed zxid with > partial multi-op txn > --- > > Key: ZOOKEEPER-3127 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3127 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.4, 3.6.0, 3.4.13 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Found this issue while checking the code for another issue, this is a > relatively rare case which we haven't seen it on prod so far. > Currently, the lastProcessedZxid is updated when applying the first txn of > multi-op, if there is a snapshot in progress, it's possible that the zxid > associated with the snapshot only include partial of the multi op. > When loading snapshot, it will only load the txns after the zxid associated > with snapshot file, which could data inconsistency due to missing sub txns. > To avoid this, we only update the lastProcessedZxid when the whole multi-op > txn is applied to DataTree. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3129) Improve ZK Client resiliency by throwing a jute.maxbuffer size exception before sending a request to server
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16605005#comment-16605005 ] Karan Mehta commented on ZOOKEEPER-3129: Thoughts anyone? > Improve ZK Client resiliency by throwing a jute.maxbuffer size exception > before sending a request to server > --- > > Key: ZOOKEEPER-3129 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3129 > Project: ZooKeeper > Issue Type: Improvement >Reporter: Karan Mehta >Priority: Major > > Zookeeper is mostly operated in controlled environments and the client/server > properties are usually known. With this Jira, I would like to propose a new > property on client side that represents the max jute buffer size server is > going to accept. > On the ZKClient, in case of multi Op, the request is serialized and hence we > know the size of complete packet that will be sent. We can use this new > property to determine if the we are exceeding the limit and throw some form > of KeeperException. This would be fail fast mechanism and the application can > potentially retry by chunking up the request or serializing. > Since the same properties are now present in two locations, over time, two > possibilities can happen. > -- Server jutebuffer accepts value is more than what is specified on client > side > The application might end up serializing it or zkclient can be made > configurable to retry even when it gets this exception > -- Server jutebuffer accepts value is lower than what is specified on client > side > That would have failed previously as well, so there is no change in behavior > This would help silent failures like HBASE-18549 getting avoided. > Thoughts [~apurtell] [~xucang] [~anmolnar] [~hanm] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper issue #612: [ZOOKEEPER-3131] Remove watcher when session closed in...
Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/612 Is there any easy way to trigger the build? It's a bit painful to doing trigger this by adding empty commit. ---
[jira] [Resolved] (ZOOKEEPER-3111) Add socket buffer size option to tune the TCP throughput between leader and learner
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangmin Lv resolved ZOOKEEPER-3111. --- Resolution: Not A Problem > Add socket buffer size option to tune the TCP throughput between leader and > learner > --- > > Key: ZOOKEEPER-3111 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3111 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Add the socket setting to let us able to tune the TCP receive and send window > size to improve the throughput during network congestion or transferring > large snapshot size during syncing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3111) Add socket buffer size option to tune the TCP throughput between leader and learner
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604962#comment-16604962 ] Fangmin Lv commented on ZOOKEEPER-3111: --- Commented in the PR, also mention it here for easier reference: The performance of different buffer size is highly depending on the traffic pattern, and may actually introduce higher latency, so it's hard to give a suggested buffer size. Internally, we're not customizing the buffer size with this option anymore, instead we're using the system TCP auto-tuning options net.ipv4.tcp_wmem and net.ipv4.tcp_rmem, which are more flexible and will automatically adjusts socket buffer sizes as needed to optimally balance TCP performance and memory usage. Since it's hard to configure correct, we decided to deprecate this patch. > Add socket buffer size option to tune the TCP throughput between leader and > learner > --- > > Key: ZOOKEEPER-3111 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3111 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Minor > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Add the socket setting to let us able to tune the TCP receive and send window > size to improve the throughput during network congestion or transferring > large snapshot size during syncing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper pull request #593: [ZOOKEEPER-3111] Add socket buffer size option ...
Github user lvfangmin closed the pull request at: https://github.com/apache/zookeeper/pull/593 ---
[GitHub] zookeeper issue #606: [ZOOKEEPER-3127] Fixing potential data inconsistency d...
Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/606 committed to master @lvfangmin @breed ---
[jira] [Resolved] (ZOOKEEPER-3127) Fixing potential data inconsistency due to update last processed zxid with partial multi-op txn
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Han resolved ZOOKEEPER-3127. Resolution: Fixed Fix Version/s: (was: 3.4.14) (was: 3.5.5) Issue resolved by pull request 606 [https://github.com/apache/zookeeper/pull/606] > Fixing potential data inconsistency due to update last processed zxid with > partial multi-op txn > --- > > Key: ZOOKEEPER-3127 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3127 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.4, 3.6.0, 3.4.13 >Reporter: Fangmin Lv >Assignee: Fangmin Lv >Priority: Critical > Labels: pull-request-available > Fix For: 3.6.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Found this issue while checking the code for another issue, this is a > relatively rare case which we haven't seen it on prod so far. > Currently, the lastProcessedZxid is updated when applying the first txn of > multi-op, if there is a snapshot in progress, it's possible that the zxid > associated with the snapshot only include partial of the multi op. > When loading snapshot, it will only load the txns after the zxid associated > with snapshot file, which could data inconsistency due to missing sub txns. > To avoid this, we only update the lastProcessedZxid when the whole multi-op > txn is applied to DataTree. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper pull request #606: [ZOOKEEPER-3127] Fixing potential data inconsis...
Github user asfgit closed the pull request at: https://github.com/apache/zookeeper/pull/606 ---
[GitHub] zookeeper issue #590: [ZOOKEEPER-1177] Add the memory optimized watch manage...
Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/590 Thanks @nkalmar for the suggestion of bench project position, I was following the same directory as the src/java/systest for now, do you think we can move them together later? I'm not against to move it now. If we want to move now, just to confirm the directory is src/java/org/apache/zookeeper/bench/ without main folder, right? ---
[GitHub] zookeeper issue #593: [ZOOKEEPER-3111] Add socket buffer size option to tune...
Github user lvfangmin commented on the issue: https://github.com/apache/zookeeper/pull/593 @maoling, sorry for the lately reply, I forgot to reply this PR. We've done some benchmark when we added this code a year ago, I was trying to find some benchmark data at that point, but I cannot find it yet. I've done some benchmark internally today, and found the performance of different buffer size is highly depending on the traffic pattern, and may actually introduce higher latency, so it's hard to give a suggested buffer size. But I don't think the extra negotiation time during connecting is a problem here, since we only set the buffer size between quorum servers, which is very few connections and it's stable. Internally, we're not customizing the buffer size with this option anymore, instead we're using the system TCP Autotuning options net.ipv4.tcp_wmem and net.ipv4.tcp_rmem, which are more flexible and will automatically adjusts socket buffer sizes as needed to optimally balance TCP performance and memory usage, so we're not customizing the buffer size anymore. **As for the unit test**, those SO_SENDBUF and SO_RCVBUF options are used by the platform's networking code as a hint for the size to set the underlying network I/O buffers, so it doesn't always guarantee it will use what you told it to. So I think this test is not that useful from this point of view. Overall, I think we probably don't need this custom option anymore because it's hard to configure right. ---
[jira] [Commented] (ZOOKEEPER-2913) testEphemeralNodeDeletion is flaky
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604742#comment-16604742 ] Hudson commented on ZOOKEEPER-2913: --- SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #176 (See [https://builds.apache.org/job/ZooKeeper-trunk/176/]) ZOOKEEPER-2913: testEphemeralNodeDeletion is flaky (andor: rev e7ac12c952b47921bfecbad52112be51f5b9ede5) * (edit) src/java/test/org/apache/zookeeper/server/quorum/EphemeralNodeDeletionTest.java > testEphemeralNodeDeletion is flaky > -- > > Key: ZOOKEEPER-2913 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2913 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.6.0 >Reporter: Patrick Hunt >Assignee: maoling >Priority: Major > Labels: flaky, pull-request-available > Fix For: 3.6.0, 3.5.5 > > Time Spent: 1.5h > Remaining Estimate: 0h > > testEphemeralNodeDeletion is showing up as flakey across a number of jobs. > 1.https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html > 2.https://builds.apache.org/job/ZooKeeper_branch34_java9/305/testReport/junit/org.apache.zookeeper.server.quorum/EphemeralNodeDeletionTest/testEphemeralNodeDeletion/ > After session close ephemeral node must be deleted -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3138) Potential race condition with Quorum Peer mutual authentication via SASL
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Grzybek updated ZOOKEEPER-3138: Description: I'm in the process of reconfiguring the ensemble to use mutual quorum peer authentication using SASL (ZOOKEEPER-1045). In order to understand the impact on my code, I've checked _how it works_. Now I'm running & debugging {{org.apache.zookeeper.server.quorum.auth.QuorumDigestAuthTest#testValidCredentials()}} test case. I have now six threads (3 peers contacting each other): * "QuorumConnectionThread-[myid=0]-2@1483" prio=5 tid=0x2b nid=NA runnable * "QuorumConnectionThread-[myid=0]-3@1491" prio=5 tid=0x36 nid=NA runnable * "QuorumConnectionThread-[myid=1]-1@1481" prio=5 tid=0x2d nid=NA runnable * "QuorumConnectionThread-[myid=1]-4@1505" prio=5 tid=0x3c nid=NA runnable * "QuorumConnectionThread-[myid=2]-2@1495" prio=5 tid=0x37 nid=NA runnable * "QuorumConnectionThread-[myid=2]-4@1506" prio=5 tid=0x3d nid=NA runnable at this point of invocation: {noformat} java.lang.Thread.State: RUNNABLE at org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handleNameCallback(SaslQuorumServerCallbackHandler.java:101) at org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handle(SaslQuorumServerCallbackHandler.java:82) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthServer.authenticate(SaslQuorumAuthServer.java:100) at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:467) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:386) at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReceiverThread.run(QuorumCnxManager.java:422) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} which is this line: {code:java} private void handleNameCallback(NameCallback nc) { // check to see if this user is in the user password database. if (credentials.get(nc.getDefaultName()) == null) { LOG.warn("User '{}' not found in list of DIGEST-MD5 authenticateable users.", nc.getDefaultName()); return; } nc.setName(nc.getDefaultName()); /* >>> */ userName = nc.getDefaultName(); /* <<< */ } {code} each pair of threads is operating on single instance of {{org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler#userName}}. In the stack trace we have both shared and local variables/fields: * {{o.a.z.server.quorum.QuorumCnxManager.QuorumConnectionReceiverThread#sock}} is thread-specific (ok) * {{o.a.z.server.quorum.QuorumCnxManager#authServer}} is peer-specific (instance of {{o.a.z.server.quorum.auth.SaslQuorumAuthServer}}) but without a state that changes * {{javax.security.sasl.SaslServer}} is thread-specific (ok) - this instance is created to handle sasl authentication, but is created using peer-specific JAAS subject (which is ok) and peer-specific {{o.a.z.server.quorum.auth.SaslQuorumAuthServer#serverLogin.callbackHadler}} {color:red}which is potentially a problem{color} Each (out of six) thread handles different connection, but each pair (for given QuorumPeer) calls {{o.a.z.server.quorum.auth.SaslQuorumServerCallbackHandler#handleNameCallback()}} which modifies shared (peer-specific) field - {{userName}}. I understand that [according to the example from Wiki|https://cwiki.apache.org/confluence/display/ZOOKEEPER/Server-Server+mutual+authentication] all peers may use the same credentials (in simplest case). But the "userName" comes from data sent by each peer, like this: {noformat} charset=utf-8,\ username="test",\ realm="zk-quorum-sasl-md5",\ nonce="iBqYWtaCrEE013S6Dv6xiOsR9uX2l/qKZcEZ1pm2",\ nc=0001,\ cnonce="LVaL9XYFjNxVBPCjPewXjEBsj9GuwIfBN/RXsKt5",\ digest-uri="zookeeper-quorum/zk-quorum-sasl-md5",\ maxbuf=65536,\ response=dd4e9e2115ec2e304484d5191f3fc771,\ qop=auth,\ authzid="test" {noformat} *And I can imagine such JAAS configuration for DIGEST-MD5 SASL algorithm, that each peer uses own credentials and is able to validate other peers' specific credentials.*: {noformat} QuorumServer { org.apache.zookeeper.server.auth.DigestLoginModule required user_peer1="peer1"; user_peer2="peer2"; user_peer3="peer3"; }; QuorumLearner1 { org.apache.zookeeper.server.auth.DigestLoginModule required username="peer1" password="peer1"; }; ... QuorumLearner2 {
[jira] [Updated] (ZOOKEEPER-3138) Potential race condition with Quorum Peer mutual authentication via SASL
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Grzybek updated ZOOKEEPER-3138: Description: I'm in the process of reconfiguring the ensemble to use mutual quorum peer authentication using SASL (ZOOKEEPER-1045). In order to understand the impact on my code, I've checked _how it works_. Now I'm running & debugging {{org.apache.zookeeper.server.quorum.auth.QuorumDigestAuthTest#testValidCredentials()}} test case. I have now six threads (3 peers contacting each other): * "QuorumConnectionThread-[myid=0]-2@1483" prio=5 tid=0x2b nid=NA runnable * "QuorumConnectionThread-[myid=0]-3@1491" prio=5 tid=0x36 nid=NA runnable * "QuorumConnectionThread-[myid=1]-1@1481" prio=5 tid=0x2d nid=NA runnable * "QuorumConnectionThread-[myid=1]-4@1505" prio=5 tid=0x3c nid=NA runnable * "QuorumConnectionThread-[myid=2]-2@1495" prio=5 tid=0x37 nid=NA runnable * "QuorumConnectionThread-[myid=2]-4@1506" prio=5 tid=0x3d nid=NA runnable at this point of invocation: {noformat} java.lang.Thread.State: RUNNABLE at org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handleNameCallback(SaslQuorumServerCallbackHandler.java:101) at org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handle(SaslQuorumServerCallbackHandler.java:82) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthServer.authenticate(SaslQuorumAuthServer.java:100) at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:467) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:386) at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReceiverThread.run(QuorumCnxManager.java:422) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} which is this line: {code:java} private void handleNameCallback(NameCallback nc) { // check to see if this user is in the user password database. if (credentials.get(nc.getDefaultName()) == null) { LOG.warn("User '{}' not found in list of DIGEST-MD5 authenticateable users.", nc.getDefaultName()); return; } nc.setName(nc.getDefaultName()); /* >>> */ userName = nc.getDefaultName(); /* <<< */ } {code} each pair of threads is operating on single instance of {{org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler#userName}}. In the stack trace we have both shared and local variables/fields: * {{o.a.z.server.quorum.QuorumCnxManager.QuorumConnectionReceiverThread#sock}} is thread-specific (ok) * {{o.a.z.server.quorum.QuorumCnxManager#authServer}} is peer-specific (instance of {{o.a.z.server.quorum.auth.SaslQuorumAuthServer}}) but without a state that changes * {{javax.security.sasl.SaslServer}} is thread-specific (ok) - this instance is created to handle sasl authentication, but is created using peer-specific JAAS subject (which is ok) and peer-specific {{o.a.z.server.quorum.auth.SaslQuorumAuthServer#serverLogin.callbackHadler}} {color:red}which is potentially a problem{color} Each (out of six) thread handles different connection, but each pair (for given QuorumPeer) calls {{o.a.z.server.quorum.auth.SaslQuorumServerCallbackHandler#handleNameCallback()}} which modifies shared (peer-specific) field - {{userName}}. I understand that [according to the example from Wiki|https://cwiki.apache.org/confluence/display/ZOOKEEPER/Server-Server+mutual+authentication] all peers may use the same credentials (in simplest case). But the "userName" comes from data sent by each peer, like this: {noformat} charset=utf-8,\ username="test",\ realm="zk-quorum-sasl-md5",\ nonce="iBqYWtaCrEE013S6Dv6xiOsR9uX2l/qKZcEZ1pm2",\ nc=0001,\ cnonce="LVaL9XYFjNxVBPCjPewXjEBsj9GuwIfBN/RXsKt5",\ digest-uri="zookeeper-quorum/zk-quorum-sasl-md5",\ maxbuf=65536,\ response=dd4e9e2115ec2e304484d5191f3fc771,\ qop=auth,\ authzid="test" {noformat} *And I can imagine such JAAS configuration for DIGEST-MD5 SASL algorithm, that each peer uses own credentials and is able to validate other peers' specific credentials.*: {noformat} QuorumServer { org.apache.zookeeper.server.auth.DigestLoginModule required user_peer1="peer1"; user_peer2="peer2"; user_peer3="peer3"; }; QuorumLearner1 { org.apache.zookeeper.server.auth.DigestLoginModule required username="peer1" password="peer1"; }; QuorumLearner2 {
[jira] [Updated] (ZOOKEEPER-3138) Potential race condition with Quorum Peer mutual authentication via SASL
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Grzybek updated ZOOKEEPER-3138: Description: I'm in the process of reconfiguring the ensemble to use mutual quorum peer authentication using SASL (ZOOKEEPER-1045). In order to understand the impact on my code, I've checked _how it works_. Now I'm running & debugging {{org.apache.zookeeper.server.quorum.auth.QuorumDigestAuthTest#testValidCredentials()}} test case. I have now six threads (3 peers contacting each other): * "QuorumConnectionThread-[myid=0]-2@1483" prio=5 tid=0x2b nid=NA runnable * "QuorumConnectionThread-[myid=0]-3@1491" prio=5 tid=0x36 nid=NA runnable * "QuorumConnectionThread-[myid=1]-1@1481" prio=5 tid=0x2d nid=NA runnable * "QuorumConnectionThread-[myid=1]-4@1505" prio=5 tid=0x3c nid=NA runnable * "QuorumConnectionThread-[myid=2]-2@1495" prio=5 tid=0x37 nid=NA runnable * "QuorumConnectionThread-[myid=2]-4@1506" prio=5 tid=0x3d nid=NA runnable at this point of invocation: {noformat} java.lang.Thread.State: RUNNABLE at org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handleNameCallback(SaslQuorumServerCallbackHandler.java:101) at org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handle(SaslQuorumServerCallbackHandler.java:82) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthServer.authenticate(SaslQuorumAuthServer.java:100) at org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:467) at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:386) at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReceiverThread.run(QuorumCnxManager.java:422) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} which is this line: {code:java} private void handleNameCallback(NameCallback nc) { // check to see if this user is in the user password database. if (credentials.get(nc.getDefaultName()) == null) { LOG.warn("User '{}' not found in list of DIGEST-MD5 authenticateable users.", nc.getDefaultName()); return; } nc.setName(nc.getDefaultName()); /* >>> */ userName = nc.getDefaultName(); /* <<< */ } {code} each pair of threads is operating on single instance of {{org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler#userName}}. In the stack trace we have both shared and local variables/fields: * {{o.a.z.server.quorum.QuorumCnxManager.QuorumConnectionReceiverThread#sock}} is thread-specific (ok) * {{o.a.z.server.quorum.QuorumCnxManager#authServer}} is peer-specific (instance of {{o.a.z.server.quorum.auth.SaslQuorumAuthServer}}) but without a state that changes * {{javax.security.sasl.SaslServer}} is thread-specific (ok) - this instance is created to handle sasl authentication, but is created using peer-specific JAAS subject (which is ok) and peer-specific {{o.a.z.server.quorum.auth.SaslQuorumAuthServer#serverLogin.callbackHadler}} {color:red}which is potentially a problem{color} Each (out of six) thread handles different connection, but each pair (for given QuorumPeer) calls {{o.a.z.server.quorum.auth.SaslQuorumServerCallbackHandler#handleNameCallback()}} which modifies shared (peer-specific) field - {{userName}}. I understand that [according to the example from Wiki|https://cwiki.apache.org/confluence/display/ZOOKEEPER/Server-Server+mutual+authentication]: {noformat} QuorumServer { org.apache.zookeeper.server.auth.DigestLoginModule required user_peer1="peer1"; user_peer2="peer2"; user_peer3="peer3"; }; QuorumLearner1 { org.apache.zookeeper.server.auth.DigestLoginModule required username="peer1" password="peer1"; }; ... {noformat} All peers use the same credentials, but the "userName" comes from data sent by each peer, like this: {noformat} charset=utf-8,\ username="test",\ realm="zk-quorum-sasl-md5",\ nonce="iBqYWtaCrEE013S6Dv6xiOsR9uX2l/qKZcEZ1pm2",\ nc=0001,\ cnonce="LVaL9XYFjNxVBPCjPewXjEBsj9GuwIfBN/RXsKt5",\ digest-uri="zookeeper-quorum/zk-quorum-sasl-md5",\ maxbuf=65536,\ response=dd4e9e2115ec2e304484d5191f3fc771,\ qop=auth,\ authzid="test" {noformat} *And I can imagine such JAAS configuration for DIGEST-MD5 SASL algorithm, that each peer uses own credentials and is able to validate other peers' specific credentials.* Isn't it a race condition? Like this
[jira] [Created] (ZOOKEEPER-3138) Potential race condition with Quorum Peer mutual authentication via SASL
Grzegorz Grzybek created ZOOKEEPER-3138: --- Summary: Potential race condition with Quorum Peer mutual authentication via SASL Key: ZOOKEEPER-3138 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3138 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.13 Reporter: Grzegorz Grzybek I'm in the process of reconfiguring the ensemble to use mutual quorum peer authentication using SASL (ZOOKEEPER-1045). In order to understand the impact on my code, I've checked _how it works_. Now I'm running & debugging {{org.apache.zookeeper.server.quorum.auth.QuorumDigestAuthTest#testValidCredentials()}} test case. I have now six threads (3 peers contacting each other): * "QuorumConnectionThread-[myid=0]-2@1483" prio=5 tid=0x2b nid=NA runnable * "QuorumConnectionThread-[myid=0]-3@1491" prio=5 tid=0x36 nid=NA runnable * "QuorumConnectionThread-[myid=1]-1@1481" prio=5 tid=0x2d nid=NA runnable * "QuorumConnectionThread-[myid=1]-4@1505" prio=5 tid=0x3c nid=NA runnable * "QuorumConnectionThread-[myid=2]-2@1495" prio=5 tid=0x37 nid=NA runnable * "QuorumConnectionThread-[myid=2]-4@1506" prio=5 tid=0x3d nid=NA runnable at this point of invocation: {noformat} org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handleNameCallback(SaslQuorumServerCallbackHandler.java:101) org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler.handle(SaslQuorumServerCallbackHandler.java:82) com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthServer.authenticate(SaslQuorumAuthServer.java:100) org.apache.zookeeper.server.quorum.QuorumCnxManager.handleConnection(QuorumCnxManager.java:467) org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:386) org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReceiverThread.run(QuorumCnxManager.java:422) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) {noformat} which is this line: {code:java} private void handleNameCallback(NameCallback nc) { // check to see if this user is in the user password database. if (credentials.get(nc.getDefaultName()) == null) { LOG.warn("User '{}' not found in list of DIGEST-MD5 authenticateable users.", nc.getDefaultName()); return; } nc.setName(nc.getDefaultName()); /* >>> */ userName = nc.getDefaultName(); /* <<< */ } {code} each pair of threads is operating on single instance of {{org.apache.zookeeper.server.quorum.auth.SaslQuorumServerCallbackHandler#userName}}. In the stack trace we have both shared and local variables/fields: * {{o.a.z.server.quorum.QuorumCnxManager.QuorumConnectionReceiverThread#sock}} is thread-specific (ok) * {{o.a.z.server.quorum.QuorumCnxManager#authServer}} is peer-specific (instance of {{o.a.z.server.quorum.auth.SaslQuorumAuthServer}}) but without a state that changes * {{javax.security.sasl.SaslServer}} is thread-specific (ok) - this instance is created to handle sasl authentication, but is created using peer-specific JAAS subject (which is ok) and peer-specific {{o.a.z.server.quorum.auth.SaslQuorumAuthServer#serverLogin.callbackHadler}} {color:red}which is potentially a problem{color} Each (out of six) thread handles different connection, but each pair (for given QuorumPeer) calls {{o.a.z.server.quorum.auth.SaslQuorumServerCallbackHandler#handleNameCallback()}} which modifies shared (peer-specific) field - {{userName}}. I understand that [according to the example from Wiki|https://cwiki.apache.org/confluence/display/ZOOKEEPER/Server-Server+mutual+authentication]: {noformat} QuorumServer { org.apache.zookeeper.server.auth.DigestLoginModule required user_peer1="peer1"; user_peer2="peer2"; user_peer3="peer3"; }; QuorumLearner1 { org.apache.zookeeper.server.auth.DigestLoginModule required username="peer1" password="peer1"; }; ... {noformat} All peers use the same credentials, but the "userName" comes from data sent by each peer, like this: {noformat} charset=utf-8,\ username="test",\ realm="zk-quorum-sasl-md5",\ nonce="iBqYWtaCrEE013S6Dv6xiOsR9uX2l/qKZcEZ1pm2",\ nc=0001,\ cnonce="LVaL9XYFjNxVBPCjPewXjEBsj9GuwIfBN/RXsKt5",\ digest-uri="zookeeper-quorum/zk-quorum-sasl-md5",\ maxbuf=65536,\ response=dd4e9e2115ec2e304484d5191f3fc771,\ qop=auth,\ authzid="test" {noformat} *And I can imagine such JAAS configuration for DIGEST-MD5 SASL algorithm, that each peer uses own credentials and is able to validate other peers' specific credentials.* Isn't it a
[GitHub] zookeeper issue #587: ZOOKEEPER-3106: Zookeeper client supports IPv6 address...
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/587 @maoling please rebase the patch ---
[GitHub] zookeeper issue #573: ZOOKEEPER-3090 continue can be replaced with break
Github user maoling commented on the issue: https://github.com/apache/zookeeper/pull/573 @a470577391 Mr Jenkins has come back.could you plz close and reopen this PR to kick off he? ping @anmolnar ---
[GitHub] zookeeper issue #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is flaky
Github user maoling commented on the issue: https://github.com/apache/zookeeper/pull/608 > Do you mind opening a PR to fix the other test? @anmolnar other same tests should analyse on the specific circumstances.I will follow up and hope others' work about this issue. Sir,if you're free,could you plz step into [PR-587](https://github.com/apache/zookeeper/pull/587)(grin)? ---
[jira] [Commented] (ZOOKEEPER-1011) fix Java Barrier Documentation example's race condition issue and polish up the Barrier Documentation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604492#comment-16604492 ] Hadoop QA commented on ZOOKEEPER-1011: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2118//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2118//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2118//console This message is automatically generated. > fix Java Barrier Documentation example's race condition issue and polish up > the Barrier Documentation > - > > Key: ZOOKEEPER-1011 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1011 > Project: ZooKeeper > Issue Type: Bug > Components: documentation >Reporter: Semih Salihoglu >Assignee: maoling >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There is a race condition in the Barrier example of the java doc: > http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html. It's > in the enter() method. Here's the original example: > boolean enter() throws KeeperException, InterruptedException{ > zk.create(root + "/" + name, new byte[0], Ids.OPEN_ACL_UNSAFE, > CreateMode.EPHEMERAL_SEQUENTIAL); > while (true) { > synchronized (mutex) { > List list = zk.getChildren(root, true); > if (list.size() < size) { > mutex.wait(); > } else { > return true; > } > } > } > } > Here's the race condition scenario: > Let's say there are two machines/nodes: node1 and node2 that will use this > code to synchronize over ZK. Let's say the following steps take place: > node1 calls the zk.create method and then reads the number of children, and > sees that it's 1 and starts waiting. > node2 calls the zk.create method (doesn't call the zk.getChildren method yet, > let's say it's very slow) > node1 is notified that the number of children on the znode changed, it checks > that the size is 2 so it leaves the barrier, it does its work and then leaves > the barrier, deleting its node. > node2 calls zk.getChildren and because node1 has already left, it sees that > the number of children is equal to 1. Since node1 will never enter the > barrier again, it will keep waiting. > --- End of scenario --- > Here's Flavio's fix suggestions (copying from the email thread): > ... > I see two possible action points out of this discussion: > > 1- State clearly in the beginning that the example discussed is not correct > under the assumption that a process may finish the computation before another > has started, and the example is there for illustration purposes; > 2- Have another example following the current one that discusses the problem > and shows how to fix it. This is an interesting option that illustrates how > one could reason about a solution when developing with zookeeper. > ... > We'll go with the 2nd option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper issue #593: [ZOOKEEPER-3111] Add socket buffer size option to tune...
Github user maoling commented on the issue: https://github.com/apache/zookeeper/pull/593 @lvfangmin it will be a very useful option for tunning the network perfermance between leader and follower. but you should give us some test evidence(smirk). can we get the inclusion: > Increasing the receive buffer size can increase the performance of network I/O for high-volume connection, while decreasing it can help reduce the backlog of incoming data. or something else?what is the reasonable value you suggest? ---
[GitHub] zookeeper issue #593: [ZOOKEEPER-3111] Add socket buffer size option to tune...
Github user maoling commented on the issue: https://github.com/apache/zookeeper/pull/593 > What do you mean by 'do regression' exactly? @anmolnar like this(grin): `int index = 0; while (true) { try { testSetSocketBufferSize(); } catch (Exception e) { e.printStackTrace(); } try { Thread.sleep(200); } catch (InterruptedException e) { } System.out.println("---" + (index++)+" passed!!!"); }` ---
[GitHub] zookeeper issue #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is flaky
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/608 Committed. 3.5 and master branches. Thanks @maoling . Do you mind opening a PR to fix the other test? ---
[jira] [Resolved] (ZOOKEEPER-2913) testEphemeralNodeDeletion is flaky
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andor Molnar resolved ZOOKEEPER-2913. - Resolution: Fixed Fix Version/s: 3.5.5 3.6.0 Issue resolved by pull request 608 [https://github.com/apache/zookeeper/pull/608] > testEphemeralNodeDeletion is flaky > -- > > Key: ZOOKEEPER-2913 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2913 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.6.0 >Reporter: Patrick Hunt >Assignee: maoling >Priority: Major > Labels: flaky, pull-request-available > Fix For: 3.6.0, 3.5.5 > > Time Spent: 1h > Remaining Estimate: 0h > > testEphemeralNodeDeletion is showing up as flakey across a number of jobs. > 1.https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html > 2.https://builds.apache.org/job/ZooKeeper_branch34_java9/305/testReport/junit/org.apache.zookeeper.server.quorum/EphemeralNodeDeletionTest/testEphemeralNodeDeletion/ > After session close ephemeral node must be deleted -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper pull request #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is fla...
Github user asfgit closed the pull request at: https://github.com/apache/zookeeper/pull/608 ---
[jira] [Updated] (ZOOKEEPER-1011) fix Java Barrier Documentation example's race condition issue and polish up the Barrier Documentation
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ZOOKEEPER-1011: -- Labels: pull-request-available (was: ) > fix Java Barrier Documentation example's race condition issue and polish up > the Barrier Documentation > - > > Key: ZOOKEEPER-1011 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1011 > Project: ZooKeeper > Issue Type: Bug > Components: documentation >Reporter: Semih Salihoglu >Assignee: maoling >Priority: Major > Labels: pull-request-available > > There is a race condition in the Barrier example of the java doc: > http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html. It's > in the enter() method. Here's the original example: > boolean enter() throws KeeperException, InterruptedException{ > zk.create(root + "/" + name, new byte[0], Ids.OPEN_ACL_UNSAFE, > CreateMode.EPHEMERAL_SEQUENTIAL); > while (true) { > synchronized (mutex) { > List list = zk.getChildren(root, true); > if (list.size() < size) { > mutex.wait(); > } else { > return true; > } > } > } > } > Here's the race condition scenario: > Let's say there are two machines/nodes: node1 and node2 that will use this > code to synchronize over ZK. Let's say the following steps take place: > node1 calls the zk.create method and then reads the number of children, and > sees that it's 1 and starts waiting. > node2 calls the zk.create method (doesn't call the zk.getChildren method yet, > let's say it's very slow) > node1 is notified that the number of children on the znode changed, it checks > that the size is 2 so it leaves the barrier, it does its work and then leaves > the barrier, deleting its node. > node2 calls zk.getChildren and because node1 has already left, it sees that > the number of children is equal to 1. Since node1 will never enter the > barrier again, it will keep waiting. > --- End of scenario --- > Here's Flavio's fix suggestions (copying from the email thread): > ... > I see two possible action points out of this discussion: > > 1- State clearly in the beginning that the example discussed is not correct > under the assumption that a process may finish the computation before another > has started, and the example is there for illustration purposes; > 2- Have another example following the current one that discusses the problem > and shows how to fix it. This is an interesting option that illustrates how > one could reason about a solution when developing with zookeeper. > ... > We'll go with the 2nd option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper issue #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is flaky
Github user maoling commented on the issue: https://github.com/apache/zookeeper/pull/608 @anmolnar Take it easy(grin). > Do you think other flaky test could be impacted by the same issue? I dig into another issue[ ZOOKEEPER-3023](https://issues.apache.org/jira/browse/ZOOKEEPER-3023) which may be the same issue or the timeout `50ms` is too short to get updated data. When Jenkins does regression,the disk is heavy and busy.some state/data can't sync from leader to learner ,which may be a salient cause for flaky test. ---
[GitHub] zookeeper issue #611: ZOOKEEPER-3131
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/611 @enixon I think it leaks the map anyway, because when the new Watcher is added, new Watcher instance gets created which will be added to Map anyway and the empty one remains forever. @wangchaod Would you please consider adding a quick unit test for the patch. Otherwise, it looks good. ---
[GitHub] zookeeper issue #593: [ZOOKEEPER-3111] Add socket buffer size option to tune...
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/593 @maoling I've run the test a few times on Mac and CentOs, but it doesn't seem to be flaky for me. What do you mean by 'do regression' exactly? ---
[GitHub] zookeeper issue #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is flaky
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/608 @maoling +1 for the solution anyway. I tend to commit this anyway. Do you think other flaky test could be impacted by the same issue? ---
[GitHub] zookeeper issue #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is flaky
Github user maoling commented on the issue: https://github.com/apache/zookeeper/pull/608 @anmolnar This flaky test had happend many times found by ZOOKEEPER-2913 and ZOOKEEPER-3040 reported by Patrick Hunt The build failed report url about it is [here](https://builds.apache.org/job/ZooKeeper_branch34_java9/305/testReport/junit/org.apache.zookeeper.server.quorum/EphemeralNodeDeletionTest/testEphemeralNodeDeletion/) which now is broken. When Jenkins does regression,it will happen again.If you have more concerns,let this patch still,and waiting for this flaky test coming back :D ---
[GitHub] zookeeper issue #608: ZOOKEEPER-2913:testEphemeralNodeDeletion is flaky
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/608 Good job @maoling . However I can't see this test in the latest flaky test report: https://builds.apache.org/job/ZooKeeper-Find-Flaky-Tests/lastSuccessfulBuild/artifact/report.html Is this patch still needed? ---
[GitHub] zookeeper issue #607: WIP - ZOOKEEPER-2990: Implement probabilistic tracing
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/607 Basically my biggest concern is that this patch shouldn't go into 3.4, because it's not critical. 3.5 is just around the corner and improvements like this will encourage people to upgrade once stable version is available. ---
[GitHub] zookeeper issue #607: WIP - ZOOKEEPER-2990: Implement probabilistic tracing
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/607 Thanks for the contribution @lavacat . Please take a look at my comments in the Jira. ---
[jira] [Commented] (ZOOKEEPER-2990) Implement probabilistic tracing
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604232#comment-16604232 ] Andor Molnar commented on ZOOKEEPER-2990: - [~bkanivets] What are the target versions of this patch? I suggest 3.5 and 3.6 only, because it's a minor improvement. (should only fix critical and security issues in 3.4) > Implement probabilistic tracing > --- > > Key: ZOOKEEPER-2990 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2990 > Project: ZooKeeper > Issue Type: Improvement > Components: server >Reporter: Bogdan Kanivets >Assignee: Bogdan Kanivets >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > It would be nice to have an ability to do probabilistic tracing similar to > Cassandra > [nodetool|https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsSetTraceProbability.html] > This will help debug issues in prod systems. > I'd like to contribute if everyone is ok with the feature. > My suggestion is to add an extra parameter to ZooTrace to handle it. > Questions: > * should it be one global param or per each ZooTrace mask? I'm thinking per > mask > * should it be a new 4lw or part of 'stmk'? Leaning towards new word and > refactoring param passing to words (stmk is a special case right now). > * there are places in the code that use LOG.trace directly. That will have > to change to ZooTrace > I can make some initial implementation for demo/review. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] zookeeper issue #590: [ZOOKEEPER-1177] Add the memory optimized watch manage...
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/590 @lvfangmin Docs for the new caching parameter `zookeeper.bitHashCacheSize`? ---
[GitHub] zookeeper pull request #590: [ZOOKEEPER-1177] Add the memory optimized watch...
Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/590#discussion_r215183702 --- Diff: src/java/main/org/apache/zookeeper/server/watch/WatchManagerOptimized.java --- @@ -0,0 +1,355 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.zookeeper.server.watch; + +import java.io.PrintWriter; +import java.util.HashMap; +import java.util.BitSet; +import java.util.HashSet; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Random; +import java.util.Set; +import java.util.Iterator; +import java.lang.Iterable; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.ReadWriteLock; +import java.util.concurrent.locks.ReentrantReadWriteLock; + +import org.apache.zookeeper.WatchedEvent; +import org.apache.zookeeper.Watcher; +import org.apache.zookeeper.Watcher.Event.EventType; +import org.apache.zookeeper.Watcher.Event.KeeperState; +import org.apache.zookeeper.server.ServerCnxn; +import org.apache.zookeeper.server.util.BitMap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Optimized in memory and time complexity, compared to WatchManager, both the + * memory consumption and time complexity improved a lot, but it cannot + * efficiently remove the watcher when the session or socket is closed, for + * majority usecase this is not a problem. + * + * Changed made compared to WatchManager: + * + * - Use HashSet and BitSet to store the watchers to find a balance between + * memory usage and time complexity + * - Use ReadWriteLock instead of synchronized to reduce lock retention + * - Lazily clean up the closed watchers + */ +public class WatchManagerOptimized +implements IWatchManager, DeadWatcherListener { + +private static final Logger LOG = +LoggerFactory.getLogger(WatchManagerOptimized.class); + +private final ConcurrentHashMap pathWatches = +new ConcurrentHashMap(); + +// watcher to bit id mapping +private final BitMap watcherBitIdMap = new BitMap(); + +// used to lazily remove the dead watchers +private final WatcherCleaner watcherCleaner; + +private final ReentrantReadWriteLock addRemovePathRWLock = new ReentrantReadWriteLock(); + +public WatchManagerOptimized() { +watcherCleaner = new WatcherCleaner(this); +watcherCleaner.start(); +} + +@Override +public boolean addWatch(String path, Watcher watcher) { +boolean result = false; +addRemovePathRWLock.readLock().lock(); +try { +// avoid race condition of adding a on flying dead watcher +if (isDeadWatcher(watcher)) { +LOG.debug("Ignoring addWatch with closed cnxn"); +} else { +Integer bit = watcherBitIdMap.add(watcher); +BitHashSet watchers = pathWatches.get(path); +if (watchers == null) { +watchers = new BitHashSet(); +BitHashSet existingWatchers = pathWatches.putIfAbsent(path, watchers); +if (existingWatchers != null) { +watchers = existingWatchers; +} +} +result = watchers.add(bit); +} +} finally { +addRemovePathRWLock.readLock().unlock(); +} +return result; +} + +@Override +public boolean containsWatcher(String path, Watcher watcher) { +BitHashSet watchers = pathWatches.get(path); +if (watchers == null || !watchers.contains(watcherBitIdMap.getBit(watcher))) { +return false; +} +return true; +
[GitHub] zookeeper pull request #590: [ZOOKEEPER-1177] Add the memory optimized watch...
Github user anmolnar commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/590#discussion_r215180602 --- Diff: src/java/main/org/apache/zookeeper/server/watch/WatchManager.java --- @@ -46,15 +48,26 @@ private final Map> watch2Paths = new HashMap>(); -synchronized int size(){ +@Override +public synchronized int size(){ int result = 0; for(Set watches : watchTable.values()) { result += watches.size(); } return result; } -synchronized void addWatch(String path, Watcher watcher) { +boolean isDeadWatcher(Watcher watcher) { --- End diff -- Taking that into account and the jira fix version, this patch will definitely go into 3.5 as well. ---
[GitHub] zookeeper issue #615: ZOOKEEPER-3137: add a utility to truncate logs to a zx...
Github user anmolnar commented on the issue: https://github.com/apache/zookeeper/pull/615 @enixon Your patch looks good, this is going to be a useful tool for ZooKeeper. Have you considered adding this functionality to `TxnLogToolkit` (a tool developed by me :), instead of creating a new one? ---