from:"Mahadev Konar"

[
https://issues.apache.org/jira/browse/ZOOKEEPER-905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932498#action_12932498
]

Mahadev konar commented on ZOOKEEPER-905:
-

The patch looks good and applies to trunk. +1 to the patch, I am marking this
as PA.

enhance zkServer.sh for easier zookeeper automation-izing
-

Key: ZOOKEEPER-905
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-905
Project: Zookeeper
Issue Type: Improvement
Components: scripts
Reporter: Nicholas Harteau
Assignee: Nicholas Harteau
Priority: Minor
Fix For: 3.4.0

Attachments: zkServer.sh.diff

zkServer.sh is good at starting zookeeper and figuring out the right options
to pass along.
unfortunately if you want to wrap zookeeper startup/shutdown in any
significant way, you have to reimplement a bunch of the logic there.
the attached patch addresses a couple simple issues:
1. add a 'start-foreground' option to zkServer.sh - this allows things that
expect to manage a foregrounded process (daemontools, launchd, etc) to use
zkServer.sh instead of rolling their own to launch zookeeper
2. add a 'print-cmd' option to zkServer.sh - rather than launching zookeeper
from the script, just give me the command you'd normally use to exec
zookeeper. I found this useful when writing automation to start/stop
zookeeper as part of smoke testing zookeeper-based applications
3. Deal more gracefully with supplying alternate configuration files to
zookeeper - currently the script assumes all config files reside in
$ZOOCFGDIR - also useful for smoke testing
4. communicate extra info (JMX enabled) about zookeeper on STDERR rather
than STDOUT (necessary for #2)
5. fixes an issue on macos where readlink doesn't have the '-f' option.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-905) enhance zkServer.sh for easier zookeeper automation-izing

[
https://issues.apache.org/jira/browse/ZOOKEEPER-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mahadev konar updated ZOOKEEPER-905:

Status: Patch Available (was: Open)

enhance zkServer.sh for easier zookeeper automation-izing
-

Attachments: zkServer.sh.diff

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-756) some cleanup and improvements for zooinspector

[
https://issues.apache.org/jira/browse/ZOOKEEPER-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932503#action_12932503
]

Mahadev konar commented on ZOOKEEPER-756:
-

@colin,

Any updates on the patch? As patrick mentioned above, the patch doesnt apply
to the trunk.

some cleanup and improvements for zooinspector
--

Key: ZOOKEEPER-756
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-756
Project: Zookeeper
Issue Type: Improvement
Components: contrib
Affects Versions: 3.3.0
Reporter: Thomas Koch
Assignee: Colin Goodheart-Smithe
Fix For: 3.4.0

Attachments: zooInspectorChanges.patch

Copied from the already closed ZOOKEEPER-678:
* specify the exact URL, where the icons are from. It's best to include the
link also in the NOTICE.txt file.
It seems, that zooinspector finds it's icons only if the icons folder is in
the current path. But when I install zooinspector as part of the Zookeeper
Debian package, I want to be able to call it regardless of the current path.
Could you use getRessources or something so that I can point to the icons
location from the wrapper shell script?
Can I place the zooinspector config files in /etc/zookeeper/zooinspector/ ?
Could I give zooinspector a property to point to the config file location?
There are several places, where viewers is missspelled as Veiwers. Please
do a case insensitive search for veiw to correct these. Even the config
file defaultNodeVeiwers.cfg is missspelled like this. This has the
potential to confuse the hell out of people when debugging something!

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-931) Documentation for Hedwig


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-931:


Fix Version/s: 3.4.0

 Documentation for Hedwig
 

 Key: ZOOKEEPER-931
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-931
 Project: Zookeeper
  Issue Type: Task
  Components: contrib-hedwig
Reporter: Erwin Tam
 Fix For: 3.4.0


 This is a tracking jira for providing documentation to Hedwig on the Apache 
 site.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Watcher examples

2010-11-15 Thread Mahadev Konar

Yup. This would be really nice to have. More examples and documentaiton for 
them, would be really helpful for our users.

Thanks
mahadev


On 11/15/10 12:54 PM, Patrick Hunt ph...@apache.org wrote:

 It would be great to have more examples as part of the release
 artifact.  Would you mind creating a JIRA/patch for this?
 http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute

 I'm thinking that we could have a src/contrib/examples or src/examples
 ... what do you guys think? (mahadev?)

 Patrick


 On Thu, Nov 11, 2010 at 12:46 PM, Robert Crocombe rcroc...@gmail.com wrote:
 On Tue, Nov 9, 2010 at 12:34 PM, Jeremy Hanna
 jeremy.hanna1...@gmail.comwrote:

 Anyone know of a good blog post or docs anywhere that gives a simple
 example of Watchers in action?  I saw the one on:

 http://hadoop.apache.org/zookeeper/docs/current/javaExample.html#ch_Introduc
 tion

 but it seems kind of overly complicated for an intro to Watchers.  I
 appreciate the example but wondered if there were other examples out there.


 Appended is a Java example of using a Watcher simply to wait for the client
 to actually be connected to a server.  I used it when I was confirming to my
 satisfaction that there was a bug in the ZooKeeper recipe for WriteLock
 awhile ago.  I think this use is slightly unusual in that it is more
 interested in KeeperState than the event type.  A more conventional Watcher
 might be like the following sketch (uhm, this is Groovy), though really
 you'd have to look at both:

 @Override
 public void process(WatchedEvent event) {
 switch (event?.getType()) {
 case EventType.NodeDeleted:
 // TODO: what should we do if the node being watched is itself
 // deleted?
 LOG.error(The node being watched ' + event.getPath + ' has been deleted:
 that's not good)
 break
 case EventType.NodeChildrenChanged:
 childrenChanged(event)
 break
 default:
 LOG.debug(Ignoring event type ' + event?.getType() + ')
 break
 }
 }

 --
 Robert Crocombe

 package derp;

 import java.io.IOException;
 import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.locks.Condition;
 import java.util.concurrent.locks.Lock;
 import java.util.concurrent.locks.ReentrantLock;

 import org.apache.commons.logging.Log;
 import org.apache.commons.logging.LogFactory;
 import org.apache.zookeeper.KeeperException;
 import org.apache.zookeeper.WatchedEvent;
 import org.apache.zookeeper.Watcher;
 import org.apache.zookeeper.ZooKeeper;
 import org.apache.zookeeper.Watcher.Event.KeeperState;
 import org.apache.zookeeper.recipes.lock.WriteLock;

 public class Test {
 private static final Log LOG = LogFactory.getLog(Test.class);

 private static final String ZOO_CONFIG = 10.2.1.54:2181/test;
 private static final String LOCK_DIR = /locking-test;
 private static final int TIMEOUT_MILLIS = 1;

 private static class ConnectWatcher implements Watcher {

 private final Lock connectedLock = new ReentrantLock();
 private final Condition connectedCondition = connectedLock.newCondition();

 private final AtomicBoolean connected = new AtomicBoolean(false);

 @Override
 public void process(WatchedEvent event) {
 LOG.debug(Event:  + event);

 KeeperState keeperState = event.getState();
 switch (keeperState) {
 case SyncConnected:
 if (!connected.get()) {
 connected.set(true);
 signal();
 }
 break;
 case Expired:
 case Disconnected:
 if (connected.get()) {
 connected.set(false);
 signal();
 }
 }
 }

 public void waitForConnection() throws InterruptedException {
 connectedLock.lock();
 try {
 while (!connected.get()) {
 LOG.debug(Waiting for condition to be signalled);
 connectedCondition.await();
 LOG.debug(Woken up on condition signalled);
 }
 } finally {
 connectedLock.unlock();
 }
 LOG.debug(After signalling, we are connected);
 }

 @Override
 public String toString() {
 StringBuilder b = new StringBuilder([);
 b.append(connectedLock:).append(connectedLock);
 b.append(,connectedCondition:).append(connectedCondition);
 b.append(,connected:).append(connected);
 b.append(]);
 return b.toString();
 }

 private void signal() {
 LOG.debug(Signaling after event);
 connectedLock.lock();
 try {
 connectedCondition.signal();
 } finally {
 connectedLock.unlock();
 }
 }
 }

 private static final void fine(ZooKeeper lowerId, ZooKeeper higherId) throws
 KeeperException,
 InterruptedException {
 WriteLock lower = new WriteLock(lowerId, LOCK_DIR, null);
 WriteLock higher = new WriteLock(higherId, LOCK_DIR, null);

 boolean lowerAcquired = lower.lock();
 assert lowerAcquired;

 LOG.debug(Lower acquired lock successfully, so higher should fail);

 boolean higherAcquired = higher.lock();
 assert !higherAcquired;

 LOG.debug(Correct: higher session fails to acquire lock);

 lower.unlock();
 // Now that lower has unlocked, higher will acquire. Really should use
 // the version of WriteLock with the LockListener, but a short sleep
 // should do.
 Thread.sleep(2000);
 higher.unlock();
 // make sure we let go.
 assert !higher.isOwner();
 }

 /*
  * Using recipes from

Re: [VOTE] Release ZooKeeper 3.3.2 (candidate 0)

2010-11-10 Thread Mahadev Konar

+1 for the release.

Ran ant test and a couple of smoke tests. Create znodes and shutdown
zookeeper servers to test durability. Deleted znodes to make sure they are
deleted. Shot down servers one at a time to confirm correct behavior.

Thanks
mahadev


On 11/4/10 11:17 PM, Patrick Hunt ph...@apache.org wrote:

 I've created a candidate build for ZooKeeper 3.3.2. This is a bug fix
 release addressing twenty-six issues (eight critical) -- see the
 release notes for details.
 
 *** Please download, test and VOTE before the
 *** vote closes 11pm pacific time, Tuesday, November 9.***
 
 http://people.apache.org/~phunt/zookeeper-3.3.2-candidate-0/
 
 Should we release this?
 
 Patrick

[jira] Commented: (ZOOKEEPER-896) Improve C client to support dynamic authentication schemes

[
https://issues.apache.org/jira/browse/ZOOKEEPER-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930720#action_12930720
]

Mahadev konar commented on ZOOKEEPER-896:
-

this is interesting. Botond, can you explain you kerberos setup? Who generates
the kerberos tokens? I am very interested in plugging in kerberos with
zookeeper.

Improve C client to support dynamic authentication schemes
--

Key: ZOOKEEPER-896
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-896
Project: Zookeeper
Issue Type: Improvement
Components: c client
Affects Versions: 3.3.1
Reporter: Botond Hejj
Assignee: Botond Hejj
Fix For: 3.4.0

Attachments: ZOOKEEPER-896.patch

When we started exploring zookeeper for our requirements we found the
authentication mechanism is not flexible enough.
We want to use kerberos for authentication but using the current API we ran
into a few problems. The idea is that we get a kerberos token on the client
side and than send that token to the server with a kerberos scheme. A server
side authentication plugin can use that token to authenticate the client and
also use the token for authorization.
We ran into two problems with this approach:
1. A different kerberos token is needed for each different server that client
can connect to since kerberos uses mutual authentication. That means when the
client acquires this kerberos token it has to know which server it connects
to and generate the token according to that. The client currently can't
generate a token for a specific server. The token stored in the auth_info is
used for all the servers.
2. The kerberos token might have an expiry time so if the client loses the
connection to the server and than it tries to reconnect it should acquire a
new token. That is not possible currently since the token is stored in
auth_info and reused for every connection.
The problem can be solved if we allow the client to register a callback for
authentication instead a static token. This can be a callback with an
argument which passes the current host string. The zookeeper client code
could call this callback before it sends the authentication info to the
server to get a fresh server specific token.
This would solve our problem with the kerberos authentication and also could
be used for other more dynamic authentication schemes.
The solution could be generalization also for the java client as well.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: What happens to a follower if leader hangs?

2010-11-10 Thread Mahadev Konar

Hi Vishal,
 There are periodic pings sent from the leader to the followers.

Take a look at Leader.java:

syncedSet.add(self.getId());
synchronized (learners) {
for (LearnerHandler f : learners) {
if (f.synced()) {
syncedCount++;
syncedSet.add(f.getSid());
}
f.ping();
}
}


This code sends periodic pings to the followers to make sure they are
running fine. We should keep track of these pings and see if we havent seen
a ping packet from the leader for a long time and give up following the
leader in case we havent heard from him for a long time. This is definitely
worth fixing since we pride ourselves in being a highly available and
reliable service.

Please feel free to open a jira and work on it.
3.4 would be a good target for this.

Thanks
mahadev

On 11/10/10 12:26 PM, Vishal Kher vishalm...@gmail.com wrote:

 Hi,
 
 In Follower.followLeader() after syncing with the leader, the follower does:
 while (self.isRunning()) {
 readPacket(qp);
 processPacket(qp);
 }
 
 It looks like it relies on socket timeout expiry to figure out if the
 connection with the leader has gone down.  So a follower *with no cilents*
 may never notice a faulty leader if a Leader has a software hang, but the
 TCP connections with the peers are still valid. Since it has not cilents, it
 won't hearbeat with the Leader. If majority of followers are not connected
 to any clients, then even if other followers attempt to elect a new leader
 after detecting that the leader is unresponsive.
 
 Please correct me if I am wrong. If I am not mistaken, should we add code at
 the follower to monitor the heartbeat messages that it receives from the
 leader and take action if it misses heartbeats for time  (syncLimit *
 tickTime)? This certainly is a hypothetical case, however, I think it is
 worth a fix.
 
 Thanks.
 -Vishal

[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

[
https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930779#action_12930779
]

Mahadev konar commented on ZOOKEEPER-928:
-

good point Flavio! I totally forgot about that. That should prevent this
failure case. Vishal your thoughts?

Follower should stop following and start FLE if it does not receive pings
from the leader
-

Key: ZOOKEEPER-928
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
Project: Zookeeper
Issue Type: Bug
Components: quorum, server
Affects Versions: 3.3.2
Reporter: Vishal K
Priority: Critical
Fix For: 3.3.3, 3.4.0

In Follower.followLeader() after syncing with the leader, the follower does:
while (self.isRunning()) {
readPacket(qp);
processPacket(qp);
}
It looks like it relies on socket timeout expiry to figure out if the
connection with the leader has gone down. So a follower *with no cilents*
may never notice a faulty leader if a Leader has a software hang, but the TCP
connections with the peers are still valid. Since it has no cilents, it won't
hearbeat with the Leader. If majority of followers are not connected to any
clients, then FLE will fail even if other followers attempt to elect a new
leader.
We should keep track of pings received from the leader and see if we havent
seen
a ping packet from the leader for (syncLimit * tickTime) time and give up
following the
leader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

[
https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930785#action_12930785
]

Mahadev konar commented on ZOOKEEPER-928:
-

vishal,
Here is the definition of setSoTimeout -

{code}
public void setSoTimeout(int timeout)
throws SocketException
Enable/disable SO_TIMEOUT with the specified timeout, in milliseconds. With
this option set to a non-zero timeout, a read() call on the InputStream
associated with this Socket will block for only this amount of time. If the
timeout expires, a java.net.SocketTimeoutException is raised, though the Socket
is still valid. The option must be enabled prior to entering the blocking
operation to have effect. The timeout must be 0. A timeout of zero is
interpreted as an infinite timeout.
{code}

This means is that the read would block till timeout and throw an exception if
it doesnt hear from the leader during that time. Wouldnt this suffice?

Follower should stop following and start FLE if it does not receive pings
from the leader
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar resolved ZOOKEEPER-928.
-

   Resolution: Won't Fix
Fix Version/s: (was: 3.3.3)
   (was: 3.4.0)

No worries Vishal. Resolving the issue as wont fix. 

 Follower should stop following and start FLE if it does not receive pings 
 from the leader
 -

 Key: ZOOKEEPER-928
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.3.2
Reporter: Vishal K
Priority: Critical

 In Follower.followLeader() after syncing with the leader, the follower does:
 while (self.isRunning()) {
 readPacket(qp);
 processPacket(qp);
 }
 It looks like it relies on socket timeout expiry to figure out if the 
 connection with the leader has gone down.  So a follower *with no cilents* 
 may never notice a faulty leader if a Leader has a software hang, but the TCP 
 connections with the peers are still valid. Since it has no cilents, it won't 
 hearbeat with the Leader. If majority of followers are not connected to any 
 clients, then FLE will fail even if other followers attempt to elect a new 
 leader.
 We should keep track of pings received from the leader and see if we havent 
 seen
 a ping packet from the leader for (syncLimit * tickTime) time and give up 
 following the
 leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-926) Fork Hadoop common's test-patch.sh and modify for Zookeeper

2010-11-09 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-926:


Fix Version/s: 3.4.0

 Fork Hadoop common's test-patch.sh and modify for Zookeeper
 ---

 Key: ZOOKEEPER-926
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-926
 Project: Zookeeper
  Issue Type: Improvement
  Components: build
Reporter: Nigel Daley
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-926.patch


 Zookeeper currently uses the test-patch.sh script from the Hadoop nightly 
 dir.  This is now out of date.  I propose we just copy the updated one in 
 Hadoop common and then modify for ZK.  This will also help as ZK moves out of 
 Hadoop to it's own TLP.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets

2010-11-09 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-900:


Fix Version/s: 3.4.0

 FLE implementation should be improved to use non-blocking sockets
 -

 Key: ZOOKEEPER-900
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
 Project: Zookeeper
  Issue Type: Bug
Reporter: Vishal K
Assignee: Flavio Junqueira
Priority: Critical
 Fix For: 3.4.0


 From earlier email exchanges:
 1. Blocking connects and accepts:
 a) The first problem is in manager.toSend(). This invokes connectOne(), which 
 does a blocking connect. While testing, I changed the code so that 
 connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() 
 does a socketChannel.connect(). After starting AsyncConnect, connectOne 
 starts a timer. connectOne continues with normal operations if the connection 
 is established before the timer expires, otherwise, when the timer expires it 
 interrupts AsyncConnect() thread and returns. In this way, I can have an 
 upper bound on the amount of time we need to wait for connect to succeed. Of 
 course, this was a quick fix for my testing. Ideally, we should use Selector 
 to do non-blocking connects/accepts. I am planning to do that later once we 
 at least have a quick fix for the problem and consensus from others for the 
 real fix (this problem is big blocker for us). Note that it is OK to do 
 blocking IO in SenderWorker and RecvWorker threads since they block IO to the 
 respective !
 peer.
 b) The blocking IO problem is not just restricted to connectOne(), but also 
 in receiveConnection(). The Listener thread calls receiveConnection() for 
 each incoming connection request. receiveConnection does blocking IO to get 
 peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the 
 peer that had sent the connection request. All of this is happening from the 
 Listener. In short, if a peer fails after initiating a connection, the 
 Listener thread won't be able to accept connections from other peers, because 
 it would be stuck in read() or connetOne(). Also the code has an inherent 
 cycle. initiateConnection() and receiveConnection() will have to be very 
 carefully synchronized otherwise, we could run into deadlocks. This code is 
 going to be difficult to maintain/modify.
 Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-900) FLE implementation should be improved to use non-blocking sockets

2010-11-09 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12930225#action_12930225
 ] 

Mahadev konar commented on ZOOKEEPER-900:
-

This is definitely a good step in cleaning up QuorumCnxnManager!  I have 
updated the jira to mark it for 3.4 release. Vishal is that ok with you? Would 
you be able to provide a patch for the 3.4 release?


 FLE implementation should be improved to use non-blocking sockets
 -

 Key: ZOOKEEPER-900
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
 Project: Zookeeper
  Issue Type: Bug
Reporter: Vishal K
Assignee: Vishal K
Priority: Critical
 Fix For: 3.4.0


 From earlier email exchanges:
 1. Blocking connects and accepts:
 a) The first problem is in manager.toSend(). This invokes connectOne(), which 
 does a blocking connect. While testing, I changed the code so that 
 connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() 
 does a socketChannel.connect(). After starting AsyncConnect, connectOne 
 starts a timer. connectOne continues with normal operations if the connection 
 is established before the timer expires, otherwise, when the timer expires it 
 interrupts AsyncConnect() thread and returns. In this way, I can have an 
 upper bound on the amount of time we need to wait for connect to succeed. Of 
 course, this was a quick fix for my testing. Ideally, we should use Selector 
 to do non-blocking connects/accepts. I am planning to do that later once we 
 at least have a quick fix for the problem and consensus from others for the 
 real fix (this problem is big blocker for us). Note that it is OK to do 
 blocking IO in SenderWorker and RecvWorker threads since they block IO to the 
 respective !
 peer.
 b) The blocking IO problem is not just restricted to connectOne(), but also 
 in receiveConnection(). The Listener thread calls receiveConnection() for 
 each incoming connection request. receiveConnection does blocking IO to get 
 peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the 
 peer that had sent the connection request. All of this is happening from the 
 Listener. In short, if a peer fails after initiating a connection, the 
 Listener thread won't be able to accept connections from other peers, because 
 it would be stuck in read() or connetOne(). Also the code has an inherent 
 cycle. initiateConnection() and receiveConnection() will have to be very 
 carefully synchronized otherwise, we could run into deadlocks. This code is 
 going to be difficult to maintain/modify.
 Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-918) Review of BookKeeper Documentation (Sequence flow and failure scenarios)

2010-11-04 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-918:


Fix Version/s: 3.4.0
   3.3.3

 Review of BookKeeper Documentation (Sequence flow and failure scenarios)
 

 Key: ZOOKEEPER-918
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-918
 Project: Zookeeper
  Issue Type: Task
  Components: documentation
Reporter: Amit Jaiswal
Priority: Trivial
 Fix For: 3.3.3, 3.4.0

 Attachments: BookKeeperInternals.pdf

   Original Estimate: 2h
  Remaining Estimate: 2h

 I have prepared a document describing some of the internals of bookkeeper in 
 terms of:
 1. Sequence of operations
 2. Files layout
 3. Failure scenarios
 The document is prepared by mostly by reading the code. Can somebody who 
 understands the design review the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Windows port of ZK C api

2010-11-03 Thread Mahadev Konar

Hi Camille,
 I think definitely there is. I think a build script with a set of
requirements and a nice set of docs on how to start using it would be great.
BTW, there is a C# binding which someone wrote earlier

http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings

You can take a look at that and see if you want to extend that or write your
own.

Thanks
mahadev


On 11/3/10 7:18 AM, Fournier, Camille F. [Tech] camille.fourn...@gs.com
wrote:

 Hi everyone,
 
 We have a requirement for a native windows-compatible version of the ZK C api.
 We're currently working on various ways to do this port, but would very much
 like to submit this back to you all when we are finished so that we don't have
 to maintain the code ourselves through future releases. Is there interest in
 having this? What would you need with this patch (build scripts, etc) to
 accept it?
 
 Thanks,
 Camille

[jira] Updated: (ZOOKEEPER-917) Leader election selected incorrect leader

2010-11-03 Thread Mahadev konar (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mahadev konar updated ZOOKEEPER-917:

Fix Version/s: 3.4.0
3.3.3

Leader election selected incorrect leader
-

Key: ZOOKEEPER-917
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
Project: Zookeeper
Issue Type: Bug
Components: leaderElection, server
Affects Versions: 3.2.2
Environment: Cloudera distribution of zookeeper (patched to never
cache DNS entries)
Debian lenny
Reporter: Alexandre Hardy
Priority: Critical
Fix For: 3.3.3, 3.4.0

Attachments: zklogs-20101102144159SAST.tar.gz

We had three nodes running zookeeper:
* 192.168.130.10
* 192.168.130.11
* 192.168.130.14
192.168.130.11 failed, and was replaced by a new node 192.168.130.13
(automated startup). The new node had not participated in any zookeeper
quorum previously. The node 192.148.130.11 was permanently removed from
service and could not contribute to the quorum any further (powered off).
DNS entries were updated for the new node to allow all the zookeeper servers
to find the new node.
The new node 192.168.130.13 was selected as the LEADER, despite the fact that
it had not seen the latest zxid.
This particular problem has not been verified with later versions of
zookeeper, and no attempt has been made to reproduce this problem as yet.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

FW: [Hadoop Wiki] Update of ZooKeeper/ZKClientBindings by yfinkelstein

2010-11-02 Thread Mahadev Konar

Nice to see this!

Thanks
mahadev
-- Forwarded Message
From: Apache Wiki wikidi...@apache.org
Reply-To: common-...@hadoop.apache.org
Date: Tue, 2 Nov 2010 14:39:24 -0700
To: Apache Wiki wikidi...@apache.org
Subject: [Hadoop Wiki] Update of ZooKeeper/ZKClientBindings by
yfinkelstein

Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for
change notification.

The ZooKeeper/ZKClientBindings page has been changed by yfinkelstein.
http://wiki.apache.org/hadoop/ZooKeeper/ZKClientBindings?action=diffrev1=5;
rev2=6

--

  ||Binding||Author||URL||
  ||Scala||Steve Jenson, John
Corwin||http://github.com/twitter/scala-zookeeper-client||
  ||C#||Eric Hauser||http://github.com/ewhauser/zookeeper||
- || || || ||
+ ||Node.js||Yuri 
Finkelstein||http://github.com/yfinkelstein/node-zookeeper||

-- End of Forwarded Message

[jira] Updated: (ZOOKEEPER-915) Errors that happen during sync() processing at the leader do not get propagated back to the client.

2010-11-02 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-915:


Fix Version/s: 3.4.0
   3.3.3

 Errors that happen during sync() processing at the leader do not get 
 propagated back to the client.
 ---

 Key: ZOOKEEPER-915
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-915
 Project: Zookeeper
  Issue Type: Bug
Reporter: Benjamin Reed
 Fix For: 3.3.3, 3.4.0


 If an error in sync() processing happens at the leader (SESSION_MOVED for 
 example), they are not propagated back to the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages

2010-11-01 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12927020#action_12927020
 ] 

Mahadev konar commented on ZOOKEEPER-907:
-

thats ok with me ben. Can you review and +1 the patch it its all good to go?



 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-872) Small fixes to PurgeTxnLog

2010-11-01 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-872:


Status: Patch Available  (was: Open)

 Small fixes to PurgeTxnLog 
 ---

 Key: ZOOKEEPER-872
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-872
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-872


 PurgeTxnLog forces us to have at least 2 backups (by having count = 3. Also, 
 it prints to stdout instead of using Logger.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-850) Switch from log4j to slf4j

2010-10-31 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-850:


Fix Version/s: 3.4.0
 Assignee: Olaf Krische

 Switch from log4j to slf4j
 --

 Key: ZOOKEEPER-850
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-850
 Project: Zookeeper
  Issue Type: Improvement
  Components: java client
Affects Versions: 3.3.1
Reporter: Olaf Krische
Assignee: Olaf Krische
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-3.3.1-log4j-slf4j-20101031.patch.bz2


 Hello,
 i would like to see slf4j integrated into the zookeeper instead of relying 
 explicitly on log4j.
 slf4j is an abstract logging framework. There are adapters from slf4j to many 
 logger implementations, one of them is log4j.
 The decision which log engine to use i dont like to make so early.
 This would help me to embed zookeeper in my own applications (which use a 
 different logger implemenation, but slf4j is the basis)
 What do you think?
 (as i can see, those slf4j request flood all other projects on apache as well 
 :-)
 Maybe for 3.4 or 4.0?
 I can offer a patchset, i have experience in such an migration already. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-897) C Client seg faults during close


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925858#action_12925858
 ] 

Mahadev konar commented on ZOOKEEPER-897:
-

jared, pat,
 I am ok without a test case for this one, because its a quite hard to create 
one. I just wanted someone else to run the tests on there machines just to 
verify (since I rarely see any problems in c tests on my machine). I will go 
ahead and commit this patch for now.


 C Client seg faults during close
 

 Key: ZOOKEEPER-897
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch


 We observed a crash while closing our c client.  It was in the do_io() thread 
 that was processing as during the close() call.
 #0  queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
 #1  0x0046234e in check_events (zh=0x6bd480, events=value optimized 
 out) at src/zookeeper.c:1687
 #2  0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at 
 src/zookeeper.c:1971
 #3  0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
 #4  0x77bc59ca in start_thread () from /lib/libpthread.so.0
 #5  0x76f706fd in clone () from /lib/libc.so.6
 #6  0x in ?? ()
 We tracked down the sequence of events, and the cause is that input_buffer is 
 being freed from a thread other than the do_io thread that relies on it:
 1. do_io() call check_events()
 2. if(eventsZOOKEEPER_READ) branch executes
 3. if (rc  0) branch executes
 4. if (zh-input_buffer != zh-primer_buffer) branch executes
 .in the meantime..
  5. zookeeper_close() called
  6. if (inc_ref_counter(zh,0)!=0) branch executes
  7. cleanup_bufs() is called
  8. input_buffer is freed at the end
 . back to check_events().
 9. queue_events() is called on a NULL buffer.
 I believe the patch is to only call free_completions() in zookeeper_close() 
 and not cleanup_bufs().  The original reason cleanup_bufs() was added was to 
 call any outstanding synhcronous completions, so only free_completions (which 
 is guarded) is needed.  I will submit a patch for review with this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-897) C Client seg faults during close


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-897:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this. thanks jared!

 C Client seg faults during close
 

 Key: ZOOKEEPER-897
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch


 We observed a crash while closing our c client.  It was in the do_io() thread 
 that was processing as during the close() call.
 #0  queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
 #1  0x0046234e in check_events (zh=0x6bd480, events=value optimized 
 out) at src/zookeeper.c:1687
 #2  0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at 
 src/zookeeper.c:1971
 #3  0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
 #4  0x77bc59ca in start_thread () from /lib/libpthread.so.0
 #5  0x76f706fd in clone () from /lib/libc.so.6
 #6  0x in ?? ()
 We tracked down the sequence of events, and the cause is that input_buffer is 
 being freed from a thread other than the do_io thread that relies on it:
 1. do_io() call check_events()
 2. if(eventsZOOKEEPER_READ) branch executes
 3. if (rc  0) branch executes
 4. if (zh-input_buffer != zh-primer_buffer) branch executes
 .in the meantime..
  5. zookeeper_close() called
  6. if (inc_ref_counter(zh,0)!=0) branch executes
  7. cleanup_bufs() is called
  8. input_buffer is freed at the end
 . back to check_events().
 9. queue_events() is called on a NULL buffer.
 I believe the patch is to only call free_completions() in zookeeper_close() 
 and not cleanup_bufs().  The original reason cleanup_bufs() was added was to 
 call any outstanding synhcronous completions, so only free_completions (which 
 is guarded) is needed.  I will submit a patch for review with this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-898) C Client might not cleanup correctly during close


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925909#action_12925909
 ] 

Mahadev konar commented on ZOOKEEPER-898:
-

+1, good catch jared. I just committed this to 3.3 and trunk.

thanks!

 C Client might not cleanup correctly during close
 -

 Key: ZOOKEEPER-898
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Trivial
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch


 I was looking through the c-client code and noticed a situation where a 
 counter can be incorrectly incremented and a small memory leak can occur.
 In zookeeper.c : add_completion(), if close_requested is true, then the 
 completion will not be queued.  But at the end, outstanding_sync is still 
 incremented and free() never called on the newly allocated completion_list_t. 
  
 I will submit for review a diff that I believe corrects this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-898) C Client might not cleanup correctly during close


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-898:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

 C Client might not cleanup correctly during close
 -

 Key: ZOOKEEPER-898
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-898
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
Priority: Trivial
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-898.diff, ZOOKEEPER-898.patch


 I was looking through the c-client code and noticed a situation where a 
 counter can be incorrectly incremented and a small memory leak can occur.
 In zookeeper.c : add_completion(), if close_requested is true, then the 
 completion will not be queued.  But at the end, outstanding_sync is still 
 incremented and free() never called on the newly allocated completion_list_t. 
  
 I will submit for review a diff that I believe corrects this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-517) NIO factory fails to close connections when the number of file handles run out.


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-517:


Fix Version/s: (was: 3.3.2)
   3.3.3

moving this out to 3.3.3 and 3.4 for investigation.

 NIO factory fails to close connections when the number of file handles run 
 out.
 ---

 Key: ZOOKEEPER-517
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-517
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Reporter: Mahadev konar
Assignee: Benjamin Reed
Priority: Critical
 Fix For: 3.3.3, 3.4.0


 The code in NIO factory is such that if we fail to accept a connection due to 
 some reasons (too many file handles maybe one of them) we do not close the 
 connections that are in CLOSE_WAIT. We need to call an explicit close on 
 these sockets and then close them. One of the solutions might be to move doIO 
 before accpet so that we can still close connection even if we cannot accept 
 connections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12926004#action_12926004
 ] 

Mahadev konar commented on ZOOKEEPER-907:
-

ben, can is ZOOKEEPER-915 also marked for 3.3.2 then? 

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-884) Remove LedgerSequence references from BookKeeper documentation and comments in tests


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-884:


Fix Version/s: 3.4.0

 Remove LedgerSequence references from BookKeeper documentation and comments 
 in tests 
 -

 Key: ZOOKEEPER-884
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-884
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-bookkeeper
Affects Versions: 3.3.1
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-884.patch


 We no longer use LedgerSequence, so we need to remove references in 
 documentation and comments sprinkled throughout the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-907) Spurious KeeperErrorCode = Session moved messages


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-907:


Status: Patch Available  (was: Open)

 Spurious KeeperErrorCode = Session moved messages
 ---

 Key: ZOOKEEPER-907
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-907
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-907.patch, ZOOKEEPER-907.patch_v2


 The sync request does not set the session owner in Request.
 As a result, the leader keeps printing:
 2010-07-01 10:55:36,733 - INFO  [ProcessThread:-1:preprequestproces...@405] - 
 Got user-level KeeperException when processing sessionid:0x298d3b1fa9 
 type:sync: cxid:0x6 zxid:0xfffe txntype:unknown reqpath:/ Error 
 Path:null Error:KeeperErrorCode = Session moved

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-702:


Fix Version/s: 3.4.0

 GSoC 2010: Failure Detector Model
 -

 Key: ZOOKEEPER-702
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702
 Project: Zookeeper
  Issue Type: Wish
Reporter: Henry Robinson
Assignee: Abmar Barros
 Fix For: 3.4.0

 Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, 
 chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, 
 ZOOKEEPER-702-code.patch, ZOOKEEPER-702-doc.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, 
 ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch


 Failure Detector Module
 Possible Mentor
 Henry Robinson (henry at apache dot org)
 Requirements
 Java, some distributed systems knowledge, comfort implementing distributed 
 systems protocols
 Description
 ZooKeeper servers detects the failure of other servers and clients by 
 counting the number of 'ticks' for which it doesn't get a heartbeat from 
 other machines. This is the 'timeout' method of failure detection and works 
 very well; however it is possible that it is too aggressive and not easily 
 tuned for some more unusual ZooKeeper installations (such as in a wide-area 
 network, or even in a mobile ad-hoc network).
 This project would abstract the notion of failure detection to a dedicated 
 Java module, and implement several failure detectors to compare and contrast 
 their appropriateness for ZooKeeper. For example, Apache Cassandra uses a 
 phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which 
 is much more tunable and has some very interesting properties. This is a 
 great project if you are interested in distributed algorithms, or want to 
 help re-factor some of ZooKeeper's internal code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-904) super digest is not actually acting as a full superuser


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925087#action_12925087
 ] 

Mahadev konar commented on ZOOKEEPER-904:
-

good catch. +1 for the patch.  Ill run ant test and will commit to both 3.3.2 
and 3.4.



 super digest is not actually acting as a full superuser
 ---

 Key: ZOOKEEPER-904
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-904
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-904-332.patch, ZOOKEEPER-904.patch


 The documentation states:
 New in 3.2:  Enables a ZooKeeper ensemble administrator to access the znode 
 hierarchy as a super user. In particular no ACL checking occurs for a user 
 authenticated as super.
 However, if a super user does something like:
 zk.setACL(/, Ids.READ_ACL_UNSAFE, -1);
 the super user is now bound by read-only ACL. This is not what I would expect 
 to see given the documentation. It can be fixed by moving the chec for the 
 super authId in PrepRequestProcessor.checkACL to before the for(ACL a : 
 acl) loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Apache now has reviewboard

2010-10-26 Thread Mahadev Konar

That¹s really nice.

mahadev


On 10/25/10 10:16 PM, Patrick Hunt ph...@apache.org wrote:

 FYI:
 https://blogs.apache.org/infra/entry/reviewboard_instance_running_at_the
 
 We should start using this, I've used it for other projects and it worked
 out quite well.
 
 Patrick

[jira] Commented: (ZOOKEEPER-87) Follower does not shut itself down if its too far behind the leader.


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925165#action_12925165
 ] 

Mahadev konar commented on ZOOKEEPER-87:


exactly vishal. The problem is to avoid clients being connected to a server 
which can lag behind the leader. 

 Follower does not shut itself down if its too far behind the leader.
 

 Key: ZOOKEEPER-87
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-87
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Critical
 Fix For: 3.4.0


 Currently, the follower if lagging behind keeps sending pings to the leader 
 it will stay alive and will keep getting further and further behind the 
 leader. The follower should shut itself down if it is not able to keep up to 
 the leader within some limit so that gurantee of updates can be made to the 
 clients connected to different servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: New strategy for Netty (ZOOKEEPER-823) was: What's the QA strategy of ZooKeeper?

2010-10-26 Thread Mahadev Konar

This sounds great.
Thanks
mahadev


On 10/16/10 1:56 AM, Thomas Koch tho...@koch.ro wrote:

 Benjamin Reed:
   actually, the other way of doing the netty patch (since i'm scared of
 merges) would be to do a refactor cleanup patch with an eye toward
 netty, and then another patch to actually add netty. [...]
 Hi Benjamin,
 
 I've had exactly the same thought last evening. Instead of trying to find the
 bug(s) in the current patch, I'd like to start it over again and do small
 incremental changes from the current trunk towards the current ZOOKEEPER-823
 patch.
 Maybe I could do this in ZOOKEEPER-823 patch, this would mean to revert the
 already applied ZOOKEEPER-823 patch.
 Then I want to test each incremental step at least 5 times to find the step(s)
 that breaks ZK.
 This approach should take me another two weeks, I believe, mostly because each
 Test run takes ~15-25 minutes.
 
 Cheers,
 
 Thomas Koch, http://www.koch.ro

[jira] Commented: (ZOOKEEPER-904) super digest is not actually acting as a full superuser


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12925177#action_12925177
 ] 

Mahadev konar commented on ZOOKEEPER-904:
-

Ran the tests, it passes. I just committed this to 3.3 and trunk.

thanks Camille.

 super digest is not actually acting as a full superuser
 ---

 Key: ZOOKEEPER-904
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-904
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-904-332.patch, ZOOKEEPER-904.patch


 The documentation states:
 New in 3.2:  Enables a ZooKeeper ensemble administrator to access the znode 
 hierarchy as a super user. In particular no ACL checking occurs for a user 
 authenticated as super.
 However, if a super user does something like:
 zk.setACL(/, Ids.READ_ACL_UNSAFE, -1);
 the super user is now bound by read-only ACL. This is not what I would expect 
 to see given the documentation. It can be fixed by moving the chec for the 
 super authId in PrepRequestProcessor.checkACL to before the for(ACL a : 
 acl) loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-904) super digest is not actually acting as a full superuser


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-904:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

 super digest is not actually acting as a full superuser
 ---

 Key: ZOOKEEPER-904
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-904
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-904-332.patch, ZOOKEEPER-904.patch


 The documentation states:
 New in 3.2:  Enables a ZooKeeper ensemble administrator to access the znode 
 hierarchy as a super user. In particular no ACL checking occurs for a user 
 authenticated as super.
 However, if a super user does something like:
 zk.setACL(/, Ids.READ_ACL_UNSAFE, -1);
 the super user is now bound by read-only ACL. This is not what I would expect 
 to see given the documentation. It can be fixed by moving the chec for the 
 super authId in PrepRequestProcessor.checkACL to before the for(ACL a : 
 acl) loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Allowing a ZooKeeper server to be part of multiple clusters

2010-10-25 Thread Mahadev Konar

Hi Vishal,
 This idea  (2.) had been kicked around intially by Flavio. I think he¹ll
probably chip in on the discussion. I am just curious on the whats the idea
behind your proposal? Is this to provide some kind of failure gaurantees
between a 2 node and 3 node cluster?

Thanks
mahadev

On 10/25/10 1:05 PM, Vishal K vishalm...@gmail.com wrote:

 Hi All,
 
 I am thinking about the choices one would have to support multiple 2-node
 clusters. Assume that for some reason one needs to support multiple 2-node
 clusters.
 This would mean they will have to figure out a way to run a third instance
 of ZK server for each cluster somewhere to ensure that a ZK cluster is
 available after a failure.
 
 This works well if we have to run one or two 2-node clusters. However, what
 if we have to run many 2-node clusters?
 
 I have following options:
 1. Find m machines to run the third instance of each cluster. Run n/m
 instances of ZK on each machine.
 2. Modify ZooKeeper server to participate in multiple clusters. This will
 allow us to run y instances of third node where each instance will be part
 of n/y clusters.
 3. Run the third instance of ZK server required for the ith cluster on one
 of the server on (i+1)%n cluster. Essentially, distribute the third instance
 across the other clusters.
 
 The pros and cons of each approach are fairly obvious. While I prefer the
 third approach, I would like to check what everyone thinks about the second
 approach.
 
 Thanks.
 -Vishal

[jira] Commented: (ZOOKEEPER-805) four letter words fail with latest ubuntu nc.openbsd

2010-10-25 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924732#action_12924732
 ] 

Mahadev konar commented on ZOOKEEPER-805:
-

sure, do you want to add some documentation to zookeeper admin guide to make it 
clearer on using -q and the issue with openbsd?


 four letter words fail with latest ubuntu nc.openbsd
 

 Key: ZOOKEEPER-805
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-805
 Project: Zookeeper
  Issue Type: Bug
  Components: documentation, server
Affects Versions: 3.3.1, 3.4.0
Reporter: Patrick Hunt
Priority: Critical
 Fix For: 3.3.2, 3.4.0


 In both 3.3 branch and trunk echo stat|nc localhost 2181 fails against the 
 ZK server on Ubuntu Lucid Lynx.
 I noticed this after upgrading to lucid lynx - which is now shipping openbsd 
 nc as the default:
 OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2)
 vs nc traditional
 [v1.10-38]
 which works fine. Not sure if this is a bug in us or nc.openbsd, but it's 
 currently not working for me. Ugh.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-897) C Client seg faults during close

2010-10-25 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-897:


Status: Open  (was: Patch Available)

jared,
  The patch that you provided leaks memory for the zookeeper client. We have to 
clean up the tosend and to process buffers on close and free them. Did you 
observe the problem with which release?

I had tried to fix all the issues with zookeeper_close() in ZOOKEEPER-591. 
Also, michi has fixed a couple other issues in ZOOKEEPER-804. 

what version of code are you running? Also, can you provide some test case 
which causes this issue? (I know its hard to reproduce but even a test that 
reproduces it once in 10-20 times is good enough).


 C Client seg faults during close
 

 Key: ZOOKEEPER-897
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-897
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Reporter: Jared Cantwell
Assignee: Jared Cantwell
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEEPER-897.diff, ZOOKEEPER-897.patch


 We observed a crash while closing our c client.  It was in the do_io() thread 
 that was processing as during the close() call.
 #0  queue_buffer (list=0x6bd4f8, b=0x0, add_to_front=0) at src/zookeeper.c:969
 #1  0x0046234e in check_events (zh=0x6bd480, events=value optimized 
 out) at src/zookeeper.c:1687
 #2  0x00462d74 in zookeeper_process (zh=0x6bd480, events=2) at 
 src/zookeeper.c:1971
 #3  0x00469c34 in do_io (v=0x6bd480) at src/mt_adaptor.c:311
 #4  0x77bc59ca in start_thread () from /lib/libpthread.so.0
 #5  0x76f706fd in clone () from /lib/libc.so.6
 #6  0x in ?? ()
 We tracked down the sequence of events, and the cause is that input_buffer is 
 being freed from a thread other than the do_io thread that relies on it:
 1. do_io() call check_events()
 2. if(eventsZOOKEEPER_READ) branch executes
 3. if (rc  0) branch executes
 4. if (zh-input_buffer != zh-primer_buffer) branch executes
 .in the meantime..
  5. zookeeper_close() called
  6. if (inc_ref_counter(zh,0)!=0) branch executes
  7. cleanup_bufs() is called
  8. input_buffer is freed at the end
 . back to check_events().
 9. queue_events() is called on a NULL buffer.
 I believe the patch is to only call free_completions() in zookeeper_close() 
 and not cleanup_bufs().  The original reason cleanup_bufs() was added was to 
 call any outstanding synhcronous completions, so only free_completions (which 
 is guarded) is needed.  I will submit a patch for review with this change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-903) Create a testing jar with useful classes from ZK test source

2010-10-22 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-903:


Fix Version/s: 3.4.0

 Create a testing jar with useful classes from ZK test source
 

 Key: ZOOKEEPER-903
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-903
 Project: Zookeeper
  Issue Type: Improvement
  Components: tests
Reporter: Camille Fournier
 Fix For: 3.4.0


 From mailing list:
 -Original Message-
 From: Benjamin Reed 
 Sent: Monday, October 18, 2010 11:12 AM
 To: zookeeper-u...@hadoop.apache.org
 Subject: Re: Testing zookeeper outside the source distribution?
   we should be exposing those classes and releasing them as a testing 
 jar. do you want to open up a jira to track this issue?
 ben
 On 10/18/2010 05:17 AM, Anthony Urso wrote:
  Anyone have any pointers on how to test against ZK outside of the
  source distribution? All the fun classes (e.g. ClientBase) do not make
  it into the ZK release jar.
 
  Right now I am manually running a ZK node for the unit tests to
  connect to prior to running my test, but I would rather have something
  that ant could reliably
  automate starting and stopping for CI.
 
  Thanks,
  Anthony

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-904) super digest is not actually acting as a full superuser

2010-10-22 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-904:


Fix Version/s: 3.4.0

 super digest is not actually acting as a full superuser
 ---

 Key: ZOOKEEPER-904
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-904
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.1
Reporter: Camille Fournier
Assignee: Camille Fournier
 Fix For: 3.4.0


 The documentation states:
 New in 3.2:  Enables a ZooKeeper ensemble administrator to access the znode 
 hierarchy as a super user. In particular no ACL checking occurs for a user 
 authenticated as super.
 However, if a super user does something like:
 zk.setACL(/, Ids.READ_ACL_UNSAFE, -1);
 the super user is now bound by read-only ACL. This is not what I would expect 
 to see given the documentation. It can be fixed by moving the chec for the 
 super authId in PrepRequestProcessor.checkACL to before the for(ACL a : 
 acl) loop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: implications of netty on client connections

2010-10-22 Thread Mahadev Konar

Hi Camille,
   I am a little curious here. Does this mean you tried a single zookeeper
server with 16K clients?

Thanks
mahadev

On 10/20/10 1:07 PM, Fournier, Camille F. [Tech] camille.fourn...@gs.com
wrote:

 Thanks Patrick, I'll look and see if I can figure out a clean change for this.
 It was the kernel limit for max number of open fds for the process that was
 where the problem shows up (not zk limit). FWIW, we tested with a process fd
 limit of 16K, and ZK performed reasonably well until the fd limit was reached,
 at which point it choked. There was a throughput degradation, but mostly going
 from 0 to 4000 connections. 4000 to 16000 was mostly flat until the sharp
 drop. For our use case it is fine to have a bit of performance loss with huge
 numbers of connections, so long as we can handle the choke, which for initial
 rollout I'm planning on just monitoring for.
 
 C
 
 -Original Message-
 From: Patrick Hunt [mailto:ph...@apache.org]
 Sent: Wednesday, October 20, 2010 2:06 PM
 To: zookeeper-dev@hadoop.apache.org
 Subject: Re: implications of netty on client connections
 
 It may just be the case that we haven't tested sufficiently for this case
 (running out of fds) and we need to handle this better even in nio. Probably
 by cutting off op_connect in the selector. We should be able to do similar
 in netty.
 
 Btw, on unix one can access the open/max fd count using this:
 http://download.oracle.com/javase/6/docs/jre/api/management/extension/com/sun/
 management/UnixOperatingSystemMXBean.html
 
 
 Secondly, are you running into a kernel limit or a zk limit? Take a look at
 this post describing 1million concurrent connections to a box:
 http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb
 -part-3
 
 specifically:
 --
 
 During various test with lots of connections, I ended up making some
 additional changes to my sysctl.conf. This was part trial-and-error, I don't
 really know enough about the internals to make especially informed decisions
 about which values to change. My policy was to wait for things to break,
 check /var/log/kern.log and see what mysterious error was reported, then
 increase stuff that sounded sensible after a spot of googling. Here are the
 settings in place during the above test:
 
 net.core.rmem_max = 33554432
 net.core.wmem_max = 33554432
 net.ipv4.tcp_rmem = 4096 16384 33554432
 net.ipv4.tcp_wmem = 4096 16384 33554432
 net.ipv4.tcp_mem = 786432 1048576 26777216
 net.ipv4.tcp_max_tw_buckets = 36
 net.core.netdev_max_backlog = 2500
 vm.min_free_kbytes = 65536
 vm.swappiness = 0
 net.ipv4.ip_local_port_range = 1024 65535
 
 --
 
 
 I'm guessing that even with this, at some point you'll run into a limit in
 our server implementation. In particular I suspect that we may start to
 respond more slowly to pings, eventually getting so bad it would time out.
 We'd have to debug that and address (optimize).
 
 http://www.metabrew.com/article/a-million-user-comet-application-with-mochiwe
 b-part-3
 Patrick
 
 On Tue, Oct 19, 2010 at 7:16 AM, Fournier, Camille F. [Tech] 
 camille.fourn...@gs.com wrote:
 
 Hi everyone,
 
 I'm curious what the implications of using netty are going to be for the
 case where a server gets close to its max available file descriptors. Right
 now our somewhat limited testing has shown that a ZK server performs fine up
 to the point when it runs out of available fds, at which point performance
 degrades sharply and new connections get into a somewhat bad state. Is netty
 going to enable the server to handle this situation more gracefully (or is
 there a way to do this already that I haven't found)? Limiting connections
 from the same client is not enough since we can potentially have far more
 clients wanting to connect than available fds for certain use cases we might
 consider.
 
 Thanks,
 Camille

Re: Heisenbugs, Bohrbugs, Mandelbugs?

2010-10-22 Thread Mahadev Konar

Hi Thomas,
  Could you verify this by just testing the trunk without your patch? You
might very well be right that those tests are a little flaky.

As for the hudson builds, Nigel is working on getting the patch builds for
zookeeper running. As soon as that gets fixed this flaky tests would show up
more often. 

Thanks
mahadev


On 10/20/10 11:48 PM, Thomas Koch tho...@koch.ro wrote:

 Hi,
 
 last night I let my hudson server do 42 (sic) builds of ZooKeeper trunk. One
 of this builds failed:
 
 junit.framework.AssertionFailedError: Leader hasn't joined: 5
 at org.apache.zookeeper.test.FLETest.testLE(FLETest.java:312)
 
 I did this many builds of trunk, because in my quest to redo the client netty
 integration step by step I made one step which resulted in 2 failed builds out
 of 8. The two failures were both:
 
 junit.framework.AssertionFailedError: Threads didn't join
 at
 
org.apache.zookeeper.test.FLERestartTest.testLERestart(FLERestartTest.java:198
)
 
 I can't find any relationship between the above test and my changes. The test
 does not use the ZooKeeper client code at all. So I begin to believe that
 there are some Heisenbugs, Bohrbugs or Mandelbugs[1] in ZooKeeper that just
 happen to show up from time to time without any relationship to the current
 changes.
 
 I'll try to investigate the cause further, maybe there is some relationship
 I've not yet found. But if my assumption should apply, then these kind of bugs
 would be a strong argument in favor of refactoring. These bugs are best found
 by cleaning the code, most important implementing strict separation of
 concerns.
 
 Wouldn't you like to setup Hudson to build ZooKeeper trunk every half an hour?
 
 [1] http://en.wikipedia.org/wiki/Unusual_software_bug
 
 Best regards,
 
 Thomas Koch, http://www.koch.ro

[jira] Commented: (ZOOKEEPER-805) four letter words fail with latest ubuntu nc.openbsd

2010-10-22 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12924021#action_12924021
 ] 

Mahadev konar commented on ZOOKEEPER-805:
-

Pat,
   You think this should go into 3.3.2?


 four letter words fail with latest ubuntu nc.openbsd
 

 Key: ZOOKEEPER-805
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-805
 Project: Zookeeper
  Issue Type: Bug
  Components: documentation, server
Affects Versions: 3.3.1, 3.4.0
Reporter: Patrick Hunt
Priority: Critical
 Fix For: 3.3.2, 3.4.0


 In both 3.3 branch and trunk echo stat|nc localhost 2181 fails against the 
 ZK server on Ubuntu Lucid Lynx.
 I noticed this after upgrading to lucid lynx - which is now shipping openbsd 
 nc as the default:
 OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2)
 vs nc traditional
 [v1.10-38]
 which works fine. Not sure if this is a bug in us or nc.openbsd, but it's 
 currently not working for me. Ugh.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Restarting discussion on ZooKeeper as a TLP

2010-10-21 Thread Mahadev Konar


   NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matt Massie
   be appointed to the office of Vice President, Apache ZooKeeper, to

I think you meant Patrick Hunt ?  :)

Other than that it looks good.

Thanks
mahadev

On 10/21/10 1:28 PM, Patrick Hunt ph...@apache.org wrote:

 Ack, I missed Henry in the list, sorry! In my defense I copied this:
 http://hadoop.apache.org/zookeeper/credits.html
 
 one more try (same as before except for adding henry to the pmc):
 
 
 X. Establish the Apache ZooKeeper Project
 
WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to data serialization
for distribution at no charge to the public.
 
NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache ZooKeeper Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further
 
RESOLVED, that the Apache ZooKeeper Project be and hereby is
responsible for the creation and maintenance of software
related to data serialization; and be it further
 
RESOLVED, that the office of Vice President, Apache ZooKeeper be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache ZooKeeper Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache ZooKeeper Project; and be it further
 
RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache ZooKeeper Project:
 
  * Patrick Hunt ph...@apache.org
  * Flavio Junqueira f...@apache.org
  * Mahadev Konarmaha...@apache.org
  * Benjamin Reedbr...@apache.org
  * Henry Robinson   he...@apache.org
 
NOW, THEREFORE, BE IT FURTHER RESOLVED, that Matt Massie
be appointed to the office of Vice President, Apache ZooKeeper, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed; and be it further
 
RESOLVED, that the initial Apache ZooKeeper PMC be and hereby is
tasked with the creation of a set of bylaws intended to
encourage open development and increased participation in the
Apache ZooKeeper Project; and be it further
 
RESOLVED, that the Apache ZooKeeper Project be and hereby
is tasked with the migration and rationalization of the Apache
Hadoop ZooKeeper sub-project; and be it further
 
RESOLVED, that all responsibilities pertaining to the Apache
Hadoop ZooKeeper sub-project encumbered upon the
Apache Hadoop Project are hereafter discharged.
 
 On Thu, Oct 21, 2010 at 10:44 AM, Henry Robinson he...@cloudera.com wrote:
 
 Looks good, please do call a vote.
 
 On 21 October 2010 09:29, Patrick Hunt ph...@apache.org wrote:
 
 Here's a draft board resolution (not a vote, just discussion). It lists
 all
 current committers (except as noted in the next paragraph) as the initial
 members of the project management committee (PMC) and myself as the
 initial
 chair.
 
 Notice that I have left Andrew off the PMC as he has not been active with
 the project for over two years. I believe we should continue to include
 him
 on the committer roles subsequent to moving to tlp, however as he has not
 been an active member of the community for such a long period we would
 not
 include him on the PMC at this time. If others feel differently let me
 know,
 I'm willing to include him if the people feel differently.
 
 LMK if this looks good to you and I'll call for an official vote on this
 list (then we'll be ready to call a vote on the hadoop pmc).
 
 Regards,
 
 Patrick
 
 
 
 X. Establish the Apache ZooKeeper Project
 
WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to data serialization
for distribution at no charge to the public.
 
NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache ZooKeeper Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further
 
RESOLVED, that the Apache ZooKeeper Project be and hereby

[jira] Updated: (ZOOKEEPER-800) zoo_add_auth returns ZOK if zookeeper handle is in ZOO_CLOSED_STATE

2010-10-21 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-800:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

+1 I just committed this. thanks michi.

 zoo_add_auth returns ZOK if zookeeper handle is in ZOO_CLOSED_STATE
 ---

 Key: ZOOKEEPER-800
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-800
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.3.1
Reporter: Michi Mutsuzaki
Assignee: Michi Mutsuzaki
Priority: Minor
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-800.patch


 This happened when I called zoo_add_auth() immediately after 
 zookeeper_init(). It took me a while to figure out that authentication 
 actually failed since zoo_add_auth() returned ZOK. It should return 
 ZINVALIDSTATE instead. 
 --Michi

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: What's the QA strategy of ZooKeeper?

2010-10-15 Thread Mahadev Konar

Well said Vishal.

I really like the points you put forth!!!

Agree on all the points, but again, all the point you mention require
commitment from folks like you. Its a pretty hard task to test all the
corner cases of ZooKeeper. I'd expect everyone to pitch in for testing a
release. We should definitely work towards a plan. You should go ahead and
create a jira for the QA plan. We should all pitch in with what all should
be tested.

Thanks
mahadev

On 10/15/10 7:32 AM, Vishal K vishalm...@gmail.com wrote:

 Hi,
 
 I would like to add my few cents here.
 
 I would suggest to stay away from code cleanup unless it is absolutely
 necessary.
 
 I would also like to extend this discussion to understand the amount of
 testing/QA to be performed before a release. How do we currently qualify a
 release?
 
 Recently, we have ran into issues in ZK that I believe should have caught by
 some basic testing before the release. I will be honest in saying that,
 unfortunately, these bugs have resulted in questions being raised by several
 people in our organization about our choice of using ZooKeeper.
 Nevertheless, our product group really thinks that ZK is a cool technology,
 but we need to focus on making it robust before adding major new features to
 it.
 
 I would suggest to:
 1. Look at current bugs and see why existing test did not uncover these bugs
 and improve those tests.
 2. Look at places that need more tests and broadcast it to the community.
 Follow-up with test development.
 3. Have a crisp release QA strategy for each release.
 4. Improve API documentation as well as code documentation so that the API
 usage is clear and debugging is made easier.
 
 Comments?
 
 Thanks.
 -Vishal
 
 On Fri, Oct 15, 2010 at 9:44 AM, Thomas Koch tho...@koch.ro wrote:
 
 Hi Benjamin,
 
 thank you for your response. Please find some comments inline.
 
 Benjamin Reed:
   code quality is important, and there are things we should keep in
 mind, but in general i really don't like the idea of risking code
 breakage because of a gratuitous code cleanup. we should be watching out
 for these things when patches get submitted or when new things go in.
 I didn't want to say it that clear, but especially the new Netty code, both
 on
 client and server side is IMHO an example of new code in very bad shape.
 The
 client code patch even changes the FindBugs configuration to exclude the
 new
 code from the FindBugs checks.
 
 i think this is inline with what pat was saying. just to expand a bit.
 in my opinion clean up refactorings have the following problems:
 
 1) you risk breaking things in production for a potential future
 maintenance advantage.
 If your code is already in such a bad shape, that every change includes
 considerable risk to break something, then you already are in trouble. With
 every new feature (or bugfix!) you also risk to break something.
 If you don't have the attitude of permanent refactoring to improve the code
 quality, you will inevitably lower the maintainability of your code with
 every
 new feature. New features will build on the dirty concepts already in the
 code
 and therfor make it more expensive to ever clean things up.
 
 2) there is always subjectivity: quality code for one code quality
 zealot is often seen as a bad design by another code quality zealot.
 unless there is an objective reason to do it, don't.
 I don't agree. IMHO, the area of subjectivism in code quality is actually
 very
 small compared to hard established standards of quality metrics and best
 practices. I believe that my list given in the first mail of this thread
 gives
 examples of rather objective guidelines.
 
 3) you may cleanup the wrong way. you may restructure to make the
 current code clean and then end up rewriting and refactoring again to
 change the logic.
 Yes. Refactoring isn't easy, but necessary. Only over time you better
 understand your domain and find better structures. Over time you introduce
 features that let code grow so that it should better be split up in smaller
 units that the human brain can still handle.
 
 i think we can mitigate 1) by only doing it when necessary. as a
 corollary we can mitigate 2) and 3) by only doing refactoring/cleanups
 when motivated by some new change: fix a bug, increased performance, new
 feature, etc.
 I agree that refactoring should be carefully planned and done in small
 steps.
 Therefor I collected each refactoring item for the java client in a small
 separate bug in https://issues.apache.org/jira/browse/ZOOKEEPER-835 that
 can
 individually be discussed, reviewed and tested.
 
 Have a nice weekend after Hadoop World!
 
 Thomas Koch, http://www.koch.ro

Re: Restarting discussion on ZooKeeper as a TLP

2010-10-14 Thread Mahadev Konar

+1 for moving to TLP.

Thanks for starting the vote Pat.

mahadev


On 10/13/10 2:10 PM, Patrick Hunt ph...@apache.org wrote:

 In March of this year we discussed a request from the Apache Board, and
 Hadoop PMC, that we become a TLP rather than a subproject of Hadoop:
 
 Original discussion
 http://markmail.org/thread/42cobkpzlgotcbin
 
 I originally voted against this move, my primary concern being that we were
 not ready to move to tlp status given our small contributor base and
 limited contributor diversity. However I'd now like to revisit that
 discussion/decision. Since that time the team has been working hard to
 attract new contributors, and we've seen significant new contributions come
 in. There has also been feedback from board/pmc addressing many of these
 concerns (both on the list and in private). I am now less concerned about
 this issue and don't see it as a blocker for us to move to TLP status.
 
 A second concern was that by becoming a TLP the project would lose it's
 connection with Hadoop, a big source of new users for us. I've been assured
 (and you can see with the other projects that have moved to tlp status;
 pig/hive/hbase/etc...) that this connection will be maintained. The Hadoop
 ZooKeeper tab for example will redirect to our new homepage.
 
 Other Apache members also pointed out to me that we are essentially
 operating as a TLP within the Hadoop PMC. Most of the other PMC members have
 little or no experience with ZooKeeper and this makes it difficult for them
 to monitor and advise us. By moving to TLP status we'll be able to govern
 ourselves and better set our direction.
 
 I believe we are ready to become a TLP. Please respond to this email with
 your thoughts and any issues. I will call a vote in a few days, once
 discussion settles.
 
 Regards,
 
 Patrick

[jira] Commented: (ZOOKEEPER-823) update ZooKeeper java client to optionally use Netty for connections

2010-10-14 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12921053#action_12921053
 ] 

Mahadev konar commented on ZOOKEEPER-823:
-

I am +1 with creating a branch for this issue and keep working on it until we 
have all the issues figured out and then merge this back in.

 update ZooKeeper java client to optionally use Netty for connections
 

 Key: ZOOKEEPER-823
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-823
 Project: Zookeeper
  Issue Type: New Feature
  Components: java client
Reporter: Patrick Hunt
Assignee: Patrick Hunt
 Fix For: 3.4.0

 Attachments: NettyNettySuiteTest.rtf, 
 TEST-org.apache.zookeeper.test.NettyNettySuiteTest.txt.gz, 
 testDisconnectedAddAuth_FAILURE, testWatchAutoResetWithPending_FAILURE, 
 ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, 
 ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, 
 ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, ZOOKEEPER-823.patch, 
 ZOOKEEPER-823.patch, ZOOKEEPER-823.patch


 This jira will port the client side connection code to use netty rather than 
 direct nio.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918112#action_12918112
 ] 

Mahadev konar commented on ZOOKEEPER-804:
-

+1 for the patch. Pat can you please try it out and see it fixes the test 
cases? I will go ahead and commit if it passes for you.

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-767) Submitting Demo/Recipe Shared / Exclusive Lock Code


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918113#action_12918113
 ] 

Mahadev konar commented on ZOOKEEPER-767:
-

really good to see this. Ill try and review the code as soon as possible. 

 Submitting Demo/Recipe Shared / Exclusive Lock Code
 ---

 Key: ZOOKEEPER-767
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-767
 Project: Zookeeper
  Issue Type: Improvement
  Components: recipes
Affects Versions: 3.3.0
Reporter: Sam Baskinger
Assignee: Sam Baskinger
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-767.patch, ZOOKEEPER-767.patch, 
 ZOOKEEPER-767.patch, ZOOKEEPER-767.patch

  Time Spent: 8h

 Networked Insights would like to share-back some code for shared/exclusive 
 locking that we are using in our labs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-820) update c unit tests to ensure zombie java server processes don't cause failure


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918144#action_12918144
 ] 

Mahadev konar commented on ZOOKEEPER-820:
-

thsi does fix the if problem in the script (which took me sometime to realise 
was backwards :) ). Michi, can you also add the lsof check as follows:

{code}
if which lsof  /dev/null 21; then run the command to kill the process using 
lsof; else we dont do anything; fi
{code}

Note that this is in addition to using pid checks. We can do the following -

on stop server:

- kill using pid file first
- kill using lsof, if any such process if present.


 update c unit tests to ensure zombie java server processes don't cause 
 failure
 

 Key: ZOOKEEPER-820
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch, 
 ZOOKEEPER-820.patch


 When the c unit tests are run sometimes the server doesn't shutdown at the 
 end of the test, this causes subsequent tests (hudson esp) to fail.
 1) we should try harder to make the server shut down at the end of the test, 
 I suspect this is related to test failing/cleanup
 2) before the tests are run we should see if the old server is still running 
 and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-804) c unit tests failing due to assertion cptr failed


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-804:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

I just committed this. thanks michi!

 c unit tests failing due to assertion cptr failed
 ---

 Key: ZOOKEEPER-804
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-804
 Project: Zookeeper
  Issue Type: Bug
  Components: c client
Affects Versions: 3.4.0
 Environment: gcc 4.4.3, ubuntu lucid lynx, dual core laptop (intel)
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-804.patch


 I'm seeing this frequently:
  [exec] Zookeeper_simpleSystem::testPing : elapsed 18006 : OK
  [exec] Zookeeper_simpleSystem::testAcl : elapsed 1022 : OK
  [exec] Zookeeper_simpleSystem::testChroot : elapsed 3145 : OK
  [exec] Zookeeper_simpleSystem::testAuth ZooKeeper server started : 
 elapsed 25687 : OK
  [exec] zktest-mt: 
 /home/phunt/dev/workspace/gitzk/src/c/src/zookeeper.c:1952: 
 zookeeper_process: Assertion `cptr' failed.
  [exec] make: *** [run-check] Aborted
  [exec] Zookeeper_simpleSystem::testHangingClient
 Mahadev can you take a look?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-820) update c unit tests to ensure zombie java server processes don't cause failure


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909306#action_12909306
 ] 

Mahadev konar commented on ZOOKEEPER-820:
-

The only comment I have is  that these scripts might not work on cygwin.  Let 
me try and check how lsof works on cygwin windows.

 update c unit tests to ensure zombie java server processes don't cause 
 failure
 

 Key: ZOOKEEPER-820
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Patrick Hunt
Assignee: Michi Mutsuzaki
Priority: Critical
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-820-1.patch, ZOOKEEPER-820.patch


 When the c unit tests are run sometimes the server doesn't shutdown at the 
 end of the test, this causes subsequent tests (hudson esp) to fail.
 1) we should try harder to make the server shut down at the end of the test, 
 I suspect this is related to test failing/cleanup
 2) before the tests are run we should see if the old server is still running 
 and try to shut it down

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (ZOOKEEPER-870) Zookeeper trunk build broken.

Zookeeper trunk build broken.
-

 Key: ZOOKEEPER-870
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-870
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0


the zookeeper current trunk build is broken mostly due to some netty changes. 
This is causing a huge backlog of PA's and other impediments to the review 
process. For now I plan to disable the test and fix them as part of 3.4 later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (ZOOKEEPER-871) ClientTest testClientCleanup is failing due to high fd count.

ClientTest testClientCleanup is failing due to high fd count.
-

 Key: ZOOKEEPER-871
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-871
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0


The fd counts has increased. The tests are repeatedly failing on hudson 
machines. I probably think this is related to netty server changes. We have to 
fix this before we release 3.4

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-870) Zookeeper trunk build broken.


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909506#action_12909506
 ] 

Mahadev konar commented on ZOOKEEPER-870:
-

Testcase: testHammer took 81.115 sec
FAILED
node count not consistent expected:1771 but was:0
junit.framework.AssertionFailedError: node count not consistent expected:1771 
but was:0
at 
org.apache.zookeeper.test.ClientBase.verifyRootOfAllServersMatch(ClientBase.java:581)
at 
org.apache.zookeeper.test.AsyncHammerTest.testHammer(AsyncHammerTest.java:190)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:51)


The test case 

junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 107.528 sec
[junit] Test org.apache.zookeeper.test.NioNettySuiteHammerTest FAILED

also fails.

 Zookeeper trunk build broken.
 -

 Key: ZOOKEEPER-870
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-870
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0


 the zookeeper current trunk build is broken mostly due to some netty changes. 
 This is causing a huge backlog of PA's and other impediments to the review 
 process. For now I plan to disable the test and fix them as part of 3.4 later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-871) ClientTest testClientCleanup is failing due to high fd count.


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909519#action_12909519
 ] 

Mahadev konar commented on ZOOKEEPER-871:
-

Also:

{code}
Testcase: testHammer took 81.115 sec
FAILED
node count not consistent expected:1771 but was:0
junit.framework.AssertionFailedError: node count not consistent expected:1771 
but was:0
at 
org.apache.zookeeper.test.ClientBase.verifyRootOfAllServersMatch(ClientBase.java:581)
at 
org.apache.zookeeper.test.AsyncHammerTest.testHammer(AsyncHammerTest.java:190)


  [junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 107.528 sec
  [junit] Test org.apache.zookeeper.test.NioNettySuiteHammerTest FAILED
{code}


 ClientTest testClientCleanup is failing due to high fd count.
 -

 Key: ZOOKEEPER-871
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-871
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0


 The fd counts has increased. The tests are repeatedly failing on hudson 
 machines. I probably think this is related to netty server changes. We have 
 to fix this before we release 3.4

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-870) Zookeeper trunk build broken.


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-870:


Attachment: ZOOKEEPER-870.patch

This patch ignores the following assertions for now:

- the count of fds in ClienTest
- the count of nodes in ClientBase

These changes will be reverted back in ZOOKEEPER-871 before 3.4 is released. 
This is just a patch to get the patch process running so that review is done on 
time.


 Zookeeper trunk build broken.
 -

 Key: ZOOKEEPER-870
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-870
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-870.patch


 the zookeeper current trunk build is broken mostly due to some netty changes. 
 This is causing a huge backlog of PA's and other impediments to the review 
 process. For now I plan to disable the test and fix them as part of 3.4 later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-870) Zookeeper trunk build broken.


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-870:


Status: Patch Available  (was: Open)

 Zookeeper trunk build broken.
 -

 Key: ZOOKEEPER-870
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-870
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-870.patch


 the zookeeper current trunk build is broken mostly due to some netty changes. 
 This is causing a huge backlog of PA's and other impediments to the review 
 process. For now I plan to disable the test and fix them as part of 3.4 later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-870) Zookeeper trunk build broken.


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-870:


Attachment: ZOOKEEPER-870.patch

updated patch with comments incorporated.

 Zookeeper trunk build broken.
 -

 Key: ZOOKEEPER-870
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-870
 Project: Zookeeper
  Issue Type: Bug
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-870.patch, ZOOKEEPER-870.patch


 the zookeeper current trunk build is broken mostly due to some netty changes. 
 This is causing a huge backlog of PA's and other impediments to the review 
 process. For now I plan to disable the test and fix them as part of 3.4 later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Master error

2010-09-14 Thread Mahadev Konar

Hi Ngo,
  If the leader fails, someone else in the quorum is elected as the leader.
You might wan to read through documentaion at:

http://hadoop.apache.org/zookeeper/docs/r3.2.0/

Thanks
mahadev


On 9/14/10 6:00 PM, Ngô Văn Vĩ ngovi.se@gmail.com wrote:

 Hi All
 if  Master error. What do server is leader?
 Thanks
 
 --
 Ngô Văn Vĩ
 Công Nghệ Phần Mềm
 Phone: 01695893851

[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete

[
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909558#action_12909558
]

Mahadev konar commented on ZOOKEEPER-822:
-

visha, flavio,
If there is just one thread running at one point in time, then its ok. Also,
I am really worried about the code structure in LeaderElection.java. Its ok to
have a temporary fix, but it would be great to see some commitment from someone
on doing it right in 3.4.

Leader election taking a long time to complete
---

Key: ZOOKEEPER-822
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822
Project: Zookeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.3.0
Reporter: Vishal K
Assignee: Vishal K
Priority: Blocker
Fix For: 3.3.2, 3.4.0

Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log,
test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz,
ZOOKEEPER-822.patch_v1

Created a 3 node cluster.
1 Fail the ZK leader
2. Let leader election finish. Restart the leader and let it join the
3. Repeat
After a few rounds leader election takes anywhere 25- 60 seconds to finish.
Note- we didn't have any ZK clients and no new znodes were created.
zoo.cfg is shown below:
#Mon Jul 19 12:15:10 UTC 2010
server.1=192.168.4.12\:2888\:3888
server.0=192.168.4.11\:2888\:3888
clientPort=2181
dataDir=/var/zookeeper
syncLimit=2
server.2=192.168.4.13\:2888\:3888
initLimit=5
tickTime=2000
I have attached logs from two nodes that took a long time to form the cluster
after failing the leader. The leader was down anyways so logs from that node
shouldn't matter.
Look for START HERE. Logs after that point should be of our interest.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete

[
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909592#action_12909592
]

Mahadev konar commented on ZOOKEEPER-822:
-

vishal,
I was expecting some commitment from you for making it use a selector :).

Leader election taking a long time to complete
---

Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log,
test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz,
ZOOKEEPER-822.patch_v1

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Review Request: ZOOKEEPER-823

2010-09-08 Thread Mahadev Konar

Hi Thomas,
 I do have that on my list. I probably will be doing it by this weekend for
sure. 

Thanks
mahadev


On 9/8/10 9:16 AM, Thomas Koch tho...@koch.ro wrote:

 Hi Ben, Mahadev,
 
 Patrick suggested I might ask you for review on ZOOKEEPER-823. It
 does some refactoring on ClientCnxn and blocks other issues which
 also want to edit ClientCnxn.
 
 Thanks,
 
 Thomas Koch, http://www.koch.ro

Re: Review Request: ZOOKEEPER-823

2010-09-08 Thread Mahadev Konar

I just reviewed the patch. You can go ahead and commit it. I am going to run 
the ant test now.

Thanks
mahadev


On 9/8/10 9:23 AM, Patrick Hunt ph...@apache.org wrote:

Hudson trunk is currently failing due to some fd cleanup issue. Not sure if
I introduced that recently with the netty server change, however it's
showing up along with an intermittent failure in asynchammertest. This is
keeping the patch builds from running. If you guys could help with that as
well it would be great.

I put up a patch but someone needs to commit it (or at least +1 it)
https://issues.apache.org/jira/browse/ZOOKEEPER-867

I can't reproduce either of these issues myself. I tried multiple machine
types and also used the same vm as is being used on hudson (jdk1.6.0_11)
with no luck reproducing.

Can you guys take a look?

https://issues.apache.org/jira/browse/ZOOKEEPER-867Patrick

On Wed, Sep 8, 2010 at 9:16 AM, Thomas Koch tho...@koch.ro wrote:

 Hi Ben, Mahadev,

 Patrick suggested I might ask you for review on ZOOKEEPER-823. It
 does some refactoring on ClientCnxn and blocks other issues which
 also want to edit ClientCnxn.

 Thanks,

 Thomas Koch, http://www.koch.ro

[jira] Updated: (ZOOKEEPER-861) Missing the test SSL certificate used for running junit tests.

2010-09-07 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-861:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

I just committed this. thanks erwin!

 Missing the test SSL certificate used for running junit tests.
 --

 Key: ZOOKEEPER-861
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-861
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-hedwig
Reporter: Erwin Tam
Assignee: Erwin Tam
Priority: Minor
 Fix For: 3.4.0

 Attachments: server.p12, ZOOKEEPER-861.patch


 The Hedwig code checked into Apache is missing a test SSL certificate file 
 used for running the server junit tests.  We need this file otherwise the 
 tests that use this (e.g. TestHedwigHub) will fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-866) Adding no disk persistence option in zookeeper.

2010-09-06 Thread Mahadev konar (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906521#action_12906521
]

Mahadev konar commented on ZOOKEEPER-866:
-

Thomas, we have spent the last 3 years optimizing the throughput and latency of
zookeeper. I think we have reach the point of minimal returns with this. I
agree on the usability front you do have a point. But making it usable is
orthogonal to what I propose over here. Both can take different directions. I
am just trying to open a new area of usage for zookeeper. Does that make sense ?

Adding no disk persistence option in zookeeper.
---

Key: ZOOKEEPER-866
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-866
Project: Zookeeper
Issue Type: New Feature
Reporter: Mahadev konar
Assignee: Mahadev konar
Fix For: 3.4.0

Attachments: ZOOKEEPER-nodisk.patch

Its been seen that some folks would like to use zookeeper for very fine
grained locking. Also, in there use case they are fine with loosing all old
zookeeper state if they reboot zookeeper or zookeeper goes down. The use case
is more of a runtime locking wherein forgetting the state of locks is
acceptable in case of a zookeeper reboot. Not logging to disk allows high
throughput on and low latency on the writes to zookeeper. This would be a
configuration option to set (ofcourse the default would be logging to disk).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: High WTF count in ZooKeeper client code

2010-09-04 Thread Mahadev Konar

I was able to get hold of one of the hadoop developers. So the gist of the
story is,

They have interface tagging saying

Something like  

@Audience.limitedPrivate(target=pig)

Wherin this interface is defined for pig and is only to be used by pig
oflks. 

Interfaces can be defined as public, stable, unstable, ..

This is quite useful but given out interfaces havent chanhged in a long time
this might not be that helpful for us.

Thanks
mahadev


On 8/31/10 3:47 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

 There isnt any documentation on the interface tagging other than the running
 comments. I will try to get hold of one of the hadoop folks to get me a dump
 of the info and will create a jira!
 
 Thanks
 mahadev
 
 
 On 8/11/10 9:56 AM, Patrick Hunt ph...@apache.org wrote:
 
 wrt defining interface stability we should adopt something like hadoop
 is now doing:
 
 https://issues.apache.org/jira/browse/HADOOP-5073
 
 Mahadev, do you know if this is documented somewhere? final
 documentation, rather than the running commentary thats on this jira? We
 could adopt something similar/same. Can you create a jira for that?
 
 Patrick
 
 On 08/11/2010 08:23 AM, Thomas Koch wrote:
 Hallo Mahadev,
 
 thank you for your nice answer. Yes, we'll of cause preserve compatibility.
 Otherwise there is no chance to get accepted.
 
 I assume the following things must keep their interfaces:
 ZooKeeper (It'll call the new interface in the background), ASyncCallback,
 Watcher
 We may want to change: ClientCnxn (faktor out some things, remove dep on
 ZooKeeper)
 
 I think other classes should not be involved at all in our issues. My
 collegue
 Patrick was so kind to fill the jira issues.
 
 Best regards,
 
 Thomas
 
 
 Mahadev Konar:
 Also, I am assuming you have backwards compatability in mind when you
 suggest these changes right?
 
 The interfaces of zookeeper client should not be changing as part of this,
 though the recursive delete hasn't been introduced yet (its only available
 in 3.4, so we can move it out into a helper class).
 
 Thanks
 mahadev
 
 
 On 8/11/10 7:40 AM, Mahadev Konarmaha...@yahoo-inc.com  wrote:
 
 HI Thomas,
I read through the list of issues you posted, most of them seem
 reasonable to fix. The one's you have mentioned below might take quite a
 bit of time to fix and again a lot of testing! (just a warning :)). It
 would be great if you'd want to clean this up for 3.4. Please go ahead and
 file a jira. These improvements would be good to have in the zookeeper
 java client.
 
 For deleteRecursive, I definitely agree that it should be a helper class. I
 don't believe it should be in the direct zookeeper api!
 
 Thanks
 mahadev
 
 
 On 8/11/10 2:45 AM, Thomas Kochtho...@koch.ro  wrote:
 
 Hi,
 
 I started yesterday to work on my idea of an alternative ZooKeeper client
 interface.[1] Instead of methods on a ZooKeeper class, a user should
 instantiate an Operation (Create, Delete, ...) and forward it to an
 Executor which handles session loss errors and alikes.
 
 By doing that, I got shocked by the sheer number of WTF issues I found. I'm
 sorry for ranting now, but it gets quicker to the poing.
 
 - Hostlist as string
 
 The hostlist is parsed in the ctor of ClientCnxn. This violates the rule of
 not doing (too much) work in a ctor. Instead the ClientCnxn should receive
 an object of class HostSet. HostSet could then be instantiated e.g. with
 a comma separated string.
 
 - cyclic dependency ClientCnxn, ZooKeeper
 
 ZooKeeper instantiates ClientCnxn in its ctor with this and therefor builds
 a cyclic dependency graph between both objects. This means, you can't have
 the one without the other. So why did you bother do make them to separate
 classes in the first place?
 ClientCnxn accesses ZooKeeper.state. State should rather be a property of
 ClientCnxn. And ClientCnxn accesses zooKeeper.get???Watches() in its method
 primeConnection(). I've not yet checked, how this dependency should be
 resolved better.
 
 - Chroot is an attribute of ClientCnxn
 
 I'd like to have one process that uses ZooKeeper for different things
 (managing a list of work, locking some unrelated locks elsewhere). So I've
 components that do this work inside the same process. These components
 should get the same zookeeper-client reference chroot'ed for their needs.
 So it'd be much better, if the ClientCnxn would not care about the chroot.
 
 - deleteRecursive does not belong to the other methods
 
 DeleteRecursive has been committed to trunk already as a method to the
 zookeeper class. So in the API it has the same level as the atomic
 operations create, delete, getData, setData, etc. The user must get the
 false impression, that deleteRecursive is also an atomic operation.
 It would be better to have deleteRecursive in some helper class but not
 that deep in zookeeper's core code. Maybe I'd like to have another policy
 on how to react if deleteRecursive fails in the middle of its work?
 
 - massive code duplication

[jira] Created: (ZOOKEEPER-866) Adding no disk persistence option in zookeeper.

2010-09-04 Thread Mahadev konar (JIRA)

Adding no disk persistence option in zookeeper.
---

 Key: ZOOKEEPER-866
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-866
 Project: Zookeeper
  Issue Type: New Feature
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0


Its been seen that some folks would like to use zookeeper for very fine grained 
locking. Also, in there use case they are fine with loosing all old zookeeper 
state if they reboot zookeeper or zookeeper goes down. The use case is more of 
a runtime locking wherein forgetting the state of locks is acceptable in case 
of a zookeeper reboot. Not logging to disk allows high throughput on and low 
latency on the writes to zookeeper. This would be a configuration option to set 
(ofcourse the default would be logging to disk).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-866) Adding no disk persistence option in zookeeper.

2010-09-04 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-866:


Attachment: ZOOKEEPER-nodisk.patch

Here is a patch that I had worked on. This is not a complete patch since this 
actually changes the code and doesnt make it configurable option. I will be 
creating another patch wherein there is a configuration option for this.

Feedback is welcome.

 Adding no disk persistence option in zookeeper.
 ---

 Key: ZOOKEEPER-866
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-866
 Project: Zookeeper
  Issue Type: New Feature
Reporter: Mahadev konar
Assignee: Mahadev konar
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-nodisk.patch


 Its been seen that some folks would like to use zookeeper for very fine 
 grained locking. Also, in there use case they are fine with loosing all old 
 zookeeper state if they reboot zookeeper or zookeeper goes down. The use case 
 is more of a runtime locking wherein forgetting the state of locks is 
 acceptable in case of a zookeeper reboot. Not logging to disk allows high 
 throughput on and low latency on the writes to zookeeper. This would be a 
 configuration option to set (ofcourse the default would be logging to disk).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Problems in FLE implementation

2010-09-03 Thread Mahadev Konar

Hi Vishal,
  Thanks for picking this up. My comments are inline:


On 9/2/10 3:31 PM, Vishal K vishalm...@gmail.com wrote:

 Hi All,
 
 I had posted this message as a comment for ZOOKEEPER-822. I thought it might
 be a good idea to give a wider attention so that it will be easier to
 collect feedback.
 
 I found few problems in the FLE implementation while debugging for:
 https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email
 below might require some background. If necessary, please browse the JIRA. I
 have a patch for 1. a) and 2). I will send them out soon.
 
 1. Blocking connects and accepts:
 
 a) The first problem is in manager.toSend(). This invokes connectOne(),
 which does a blocking connect. While testing, I changed the code so that
 connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()
 does a socketChannel.connect(). After starting AsyncConnect, connectOne
 starts a timer. connectOne continues with normal operations if the
 connection is established before the timer expires, otherwise, when the
 timer expires it interrupts AsyncConnect() thread and returns. In this way,
 I can have an upper bound on the amount of time we need to wait for connect
 to succeed. Of course, this was a quick fix for my testing. Ideally, we
 should use Selector to do non-blocking connects/accepts. I am planning to do
 that later once we at least have a quick fix for the problem and consensus
 from others for the real fix (this problem is big blocker for us). Note that
 it is OK to do blocking IO in SenderWorker and RecvWorker threads since they
 block IO to the respective peer.
Vishal, I am really concerned about starting up new threads in the server.
We really need a total revamp of this code (using NIO and selector). Is the
quick fix really required. Zookeeper servers have been running in production
for a while, and this problem hasn't been noticed by anyone. Shouldn't we
fix it with NIO then?


 
 b) The blocking IO problem is not just restricted to connectOne(), but also
 in receiveConnection(). The Listener thread calls receiveConnection() for
 each incoming connection request. receiveConnection does blocking IO to get
 peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the
 peer that had sent the connection request. All of this is happening from the
 Listener. In short, if a peer fails after initiating a connection, the
 Listener thread won't be able to accept connections from other peers,
 because it would be stuck in read() or connetOne(). Also the code has an
 inherent cycle. initiateConnection() and receiveConnection() will have to be
 very carefully synchronized otherwise, we could run into deadlocks. This
 code is going to be difficult to maintain/modify.
 

 2. Buggy senderWorkerMap handling:
 The code that manages senderWorkerMap is very buggy. It is causing multiple
 election rounds. While debugging I found that sometimes after FLE a node
 will have its sendWorkerMap empty even if it has SenderWorker and RecvWorker
 threads for each peer.
IT would be great to clean it up!! I'd be happy to see this class be cleaned
up! :) 

 
 a) The receiveConnection() method calls the finish() method, which removes
 an entry from the map. Additionally, the thread itself calls finish() which
 could remove the newly added entry from the map. In short, receiveConnection
 is causing the exact condition that you mentioned above.
 
 b) Apart from the bug in finish(), receiveConnection is making an entry in
 senderWorkerMap at the wrong place. Here's the buggy code:
 SendWorker vsw = senderWorkerMap.get(sid);
 senderWorkerMap.put(sid, sw);
 if(vsw != null)
 vsw.finish();
 It makes an entry for the new thread and then calls finish, which causes the
 new thread to be removed from the Map. The old thread will also get
 terminated since finish() will interrupt the thread.
 
 3. Race condition in receiveConnection and initiateConnection:
 
 *In theory*, two peers can keep disconnecting each other's connection.
 
 Example:
 T0: Peer 0 initiates a connection (request 1)
  T1: Peer 1 receives connection from
 peer 0
  T2: Peer 1 calls receiveConnection()
 T2: Peer 0 closes connection to Peer 1 because its ID is lower.
 T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() (request
 2)
  T3: Peer 1 terminates older connection
 to peer 0
  T4: Peer 1 calls connectOne() which
 starts new sendWorker threads for peer 0
  T5: Peer 1 kills connection created in
 T3 because it receives another (request 2) connect request from 0
 
 The problem here is that while Peer 0 is accepting a connection from Peer 1
 it can also be initiating a connection to Peer 1. So if they hit the right
 frequencies they could sit in a connect/disconnect loop and cause multiple
 rounds of leader election.
 
 I think

[jira] Commented: (ZOOKEEPER-864) Hedwig C++ client improvements


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906102#action_12906102
 ] 

Mahadev konar commented on ZOOKEEPER-864:
-

michi to answer your question,
  all we need is a careful review.



 Hedwig C++ client improvements
 --

 Key: ZOOKEEPER-864
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-864
 Project: Zookeeper
  Issue Type: Improvement
Reporter: Ivan Kelly
Assignee: Ivan Kelly
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-864.diff


 I changed the socket code to use boost asio. Now the client only creates one 
 thread, and all operations are non-blocking. 
 Tests are now automated, just run make check.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-822) Leader election taking a long time to complete

[
https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mahadev konar updated ZOOKEEPER-822:

Assignee: Vishal K
Fix Version/s: 3.3.2
3.4.0

Marking this for 3.3.2, to see if we want this included in 3.3.2.

Leader election taking a long time to complete
---

Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log,
test_zookeeper_2.log, zk_leader_election.tar.gz, zookeeper-3.4.0.tar.gz

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-860) Add alternative search-provider to ZK site


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906223#action_12906223
 ] 

Mahadev konar commented on ZOOKEEPER-860:
-

alex, the assignment just means that you are working on the patch currently. A  
committer will review and provide you feedback or commit if deemed fit for the 
project. Hope that helps.

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list 
 (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already 
 added in site's skin (common for all Hadoop related projects) before (as a 
 part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this 
 issue is about enabling it for ZK. The ultimate goal is to use it at all 
 Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-860) Add alternative search-provider to ZK site


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-860:


Fix Version/s: 3.4.0

marking it for 3.4 for keeping track.

 Add alternative search-provider to ZK site
 --

 Key: ZOOKEEPER-860
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-860
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Alex Baranau
Assignee: Alex Baranau
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-860.patch


 Use search-hadoop.com service to make available search in ZK sources, MLs, 
 wiki, etc.
 This was initially proposed on user mailing list 
 (http://search-hadoop.com/m/sTZ4Y1BVKWg1). The search service was already 
 added in site's skin (common for all Hadoop related projects) before (as a 
 part of [AVRO-626|https://issues.apache.org/jira/browse/AVRO-626]) so this 
 issue is about enabling it for ZK. The ultimate goal is to use it at all 
 Hadoop's sub-projects' sites.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: race condition in InvalidSnapShotTest on client close

2010-09-03 Thread Mahadev Konar

Hi Thomas,
  Sorry for my late response. Please open a jira regarding this. Is this fixed 
in your netty patc hfor the client?

Htanks
mahadev


On 9/1/10 9:09 AM, Thomas Koch tho...@koch.ro wrote:

Hi,

I believe, that I've found a race condition in
org.apache.zookeeper.server.InvalidSnapshotTest
In this test the server is closed before the client. The client, on close(),
submits as last package with type ZooDefs.OpCode.closeSession and waits for
this package to be finished.
However, nobody is there to awake the thread from packet.wait(). The
sendThread will on cleanup call packet.notifyAll() in finishpackage.
The race condition is: If an exception occurs in the sendThread, closing is
already true, so the sendThread breaks out of it's loop, calls cleanup and
finishes. If this happens, before the main thread calls packet.wait() then
there's nobody left to awake the main thread.

Regards,

Thomas Koch, http://www.koch.ro

Re: High WTF count in ZooKeeper client code

2010-08-31 Thread Mahadev Konar

There isnt any documentation on the interface tagging other than the running 
comments. I will try to get hold of one of the hadoop folks to get me a dump of 
the info and will create a jira!

Thanks
mahadev


On 8/11/10 9:56 AM, Patrick Hunt ph...@apache.org wrote:

wrt defining interface stability we should adopt something like hadoop
is now doing:

https://issues.apache.org/jira/browse/HADOOP-5073

Mahadev, do you know if this is documented somewhere? final
documentation, rather than the running commentary thats on this jira? We
could adopt something similar/same. Can you create a jira for that?

Patrick

On 08/11/2010 08:23 AM, Thomas Koch wrote:
 Hallo Mahadev,

 thank you for your nice answer. Yes, we'll of cause preserve compatibility.
 Otherwise there is no chance to get accepted.

 I assume the following things must keep their interfaces:
 ZooKeeper (It'll call the new interface in the background), ASyncCallback,
 Watcher
 We may want to change: ClientCnxn (faktor out some things, remove dep on
 ZooKeeper)

 I think other classes should not be involved at all in our issues. My collegue
 Patrick was so kind to fill the jira issues.

 Best regards,

 Thomas


 Mahadev Konar:
 Also, I am assuming you have backwards compatability in mind when you
 suggest these changes right?

 The interfaces of zookeeper client should not be changing as part of this,
 though the recursive delete hasn't been introduced yet (its only available
 in 3.4, so we can move it out into a helper class).

 Thanks
 mahadev


 On 8/11/10 7:40 AM, Mahadev Konarmaha...@yahoo-inc.com  wrote:

 HI Thomas,
I read through the list of issues you posted, most of them seem
 reasonable to fix. The one's you have mentioned below might take quite a
 bit of time to fix and again a lot of testing! (just a warning :)). It
 would be great if you'd want to clean this up for 3.4. Please go ahead and
 file a jira. These improvements would be good to have in the zookeeper
 java client.

 For deleteRecursive, I definitely agree that it should be a helper class. I
 don't believe it should be in the direct zookeeper api!

 Thanks
 mahadev


 On 8/11/10 2:45 AM, Thomas Kochtho...@koch.ro  wrote:

 Hi,

 I started yesterday to work on my idea of an alternative ZooKeeper client
 interface.[1] Instead of methods on a ZooKeeper class, a user should
 instantiate an Operation (Create, Delete, ...) and forward it to an
 Executor which handles session loss errors and alikes.

 By doing that, I got shocked by the sheer number of WTF issues I found. I'm
 sorry for ranting now, but it gets quicker to the poing.

 - Hostlist as string

 The hostlist is parsed in the ctor of ClientCnxn. This violates the rule of
 not doing (too much) work in a ctor. Instead the ClientCnxn should receive
 an object of class HostSet. HostSet could then be instantiated e.g. with
 a comma separated string.

 - cyclic dependency ClientCnxn, ZooKeeper

 ZooKeeper instantiates ClientCnxn in its ctor with this and therefor builds
 a cyclic dependency graph between both objects. This means, you can't have
 the one without the other. So why did you bother do make them to separate
 classes in the first place?
 ClientCnxn accesses ZooKeeper.state. State should rather be a property of
 ClientCnxn. And ClientCnxn accesses zooKeeper.get???Watches() in its method
 primeConnection(). I've not yet checked, how this dependency should be
 resolved better.

 - Chroot is an attribute of ClientCnxn

 I'd like to have one process that uses ZooKeeper for different things
 (managing a list of work, locking some unrelated locks elsewhere). So I've
 components that do this work inside the same process. These components
 should get the same zookeeper-client reference chroot'ed for their needs.
 So it'd be much better, if the ClientCnxn would not care about the chroot.

 - deleteRecursive does not belong to the other methods

 DeleteRecursive has been committed to trunk already as a method to the
 zookeeper class. So in the API it has the same level as the atomic
 operations create, delete, getData, setData, etc. The user must get the
 false impression, that deleteRecursive is also an atomic operation.
 It would be better to have deleteRecursive in some helper class but not
 that deep in zookeeper's core code. Maybe I'd like to have another policy
 on how to react if deleteRecursive fails in the middle of its work?

 - massive code duplication in zookeeper class

 Each operation calls validatePath, handles the chroot, calls ClientCnxn and
 checks the return header for error. I'd like to address this with the
 operation classes:
 Each operation should receive a prechecked Path object. Calling ClientCnxn
 and error checking is not (or only partly) the concern of the operation
 but of an executor like class.

 - stat is returned by parameter

 Since one can return only one value in java it's the only choice to do so.
 Still it feels more like C then like Java. However with operator classes
 one could simply

[jira] Assigned: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-31 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar reassigned ZOOKEEPER-856:
---

Assignee: Mahadev konar

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
Assignee: Mahadev konar
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-854) BookKeeper does not compile due to changes in the ZooKeeper code

2010-08-28 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903838#action_12903838
 ] 

Mahadev konar commented on ZOOKEEPER-854:
-

+1 the patch looks good!

 BookKeeper does not compile due to changes in the ZooKeeper code
 

 Key: ZOOKEEPER-854
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-854
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-bookkeeper
Affects Versions: 3.3.1
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-854.patch, ZOOKEEPER-854.patch


 BookKeeper does not compile due to changes in the NIOServerCnxn class of 
 ZooKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-846) zookeeper client doesn't shut down cleanly on the close call

2010-08-28 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-846:


Fix Version/s: 3.3.2

We should fix this in 3.3.2 if this still exists.

 zookeeper client doesn't shut down cleanly on the close call
 

 Key: ZOOKEEPER-846
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-846
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.2.2
Reporter: Ted Yu
 Fix For: 3.3.2

 Attachments: rs-13.stack


 Using HBase 0.20.6 (with HBASE-2473) we encountered a situation where 
 Regionserver
 process was shutting down and seemed to hang.
 Here is the bottom of region server log:
 http://pastebin.com/YYawJ4jA
 zookeeper-3.2.2 is used.
 Here is relevant portion from jstack - I attempted to attach jstack twice in 
 my email to d...@hbase.apache.org but failed:
 DestroyJavaVM prio=10 tid=0x2aabb849c800 nid=0x6c60 waiting on 
 condition [0x]
java.lang.Thread.State: RUNNABLE
 regionserver/10.32.42.245:60020 prio=10 tid=0x2aabb84ce000 nid=0x6c81 
 in Object.wait() [0x43755000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x2aaab76633c0 (a 
 org.apache.zookeeper.ClientCnxn$Packet)
 at java.lang.Object.wait(Object.java:485)
 at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1099)
 - locked 0x2aaab76633c0 (a 
 org.apache.zookeeper.ClientCnxn$Packet)
 at org.apache.zookeeper.ClientCnxn.close(ClientCnxn.java:1077)
 at org.apache.zookeeper.ZooKeeper.close(ZooKeeper.java:505)
 - locked 0x2aaabf5e0c30 (a org.apache.zookeeper.ZooKeeper)
 at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.close(ZooKeeperWrapper.java:681)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:654)
 at java.lang.Thread.run(Thread.java:619)
 main-EventThread daemon prio=10 tid=0x43474000 nid=0x6c80 waiting 
 on condition [0x413f3000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x2aaabf6e9150 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
 at 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:414)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Mahadev konar (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903017#action_12903017
]

Mahadev konar commented on ZOOKEEPER-856:
-

travis,
we have had a lot of discussion on load balancing. I'd really want to try and
see how the disconnect and reconnect works for load balancing. I am also with
you that it might be a good enough soln on load balancing. I can upload a
simple patch for this. Would you have some bandwidth trying and it out and
reporting how well it works?

Connection imbalance leads to overloaded ZK instances
-

Key: ZOOKEEPER-856
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
Project: Zookeeper
Issue Type: Bug
Reporter: Travis Crawford
Attachments: zk_open_file_descriptor_count_members.gif,
zk_open_file_descriptor_count_total.gif

We've experienced a number of issues lately where ruok requests would take
upwards of 10 seconds to return, and ZooKeeper instances were extremely
sluggish. The sluggish instance requires a restart to make it responsive
again.
I believe the issue is connections are very imbalanced, leading to certain
instances having many thousands of connections, while other instances are
largely idle.
A potential solution is periodically disconnecting/reconnecting to balance
connections over time; this seems fine because sessions should not be
affected, and therefore ephemaral nodes and watches should not be affected.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-856:


Fix Version/s: 3.4.0

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Parent nodes multi-step transactions

2010-08-23 Thread Mahadev Konar

Hi Gustavo,
 There was some talk of  startTransaction(), addTransaction(), commit() kind
of api on the list.

Here is the link:

http://www.mail-archive.com/zookeeper-dev@hadoop.apache.org/msg08317.html

Mostly looked like this wasn¹t on our roadmap for short term, but definitely
something to think about longer term.

Thanks
mahadev


On 8/23/10 3:32 PM, Gustavo Niemeyer gust...@niemeyer.net wrote:

 So, we end up with something like this:
 
    A/B
    A/C/D-0
    A/C/D-1
 
 While people are thinking, let me ask this more explicitly: how hard
 would it be to add multi-step atomic actions to Zookeeper?
 
 The interest is specifically to:
 
 1) Avoid intermediate states to be persisted when the client creating
 the state crashes
 
 2) Avoid intermediate states to be seen while a coordination structure
 is being put in place
 
 I understand that there are tricks which may be used to avoid some of
 the related problems by dealing with locks, liveness nodes, and side
 services which monitor and clean up the state, but it'd be fantastic
 to have some internal support in Zookeeper to make these actions
 simpler and less error prone.  It feels like, given Zookeeper
 guarantees, it shouldn't be too hard to extend the protocol to offer
 some basic-level operation grouping (e.g. multi-create and
 multi-delete, at least).
 
 Does that make sense?
 
 --
 Gustavo Niemeyer
 http://niemeyer.net
 http://niemeyer.net/blog
 http://niemeyer.net/twitter

[jira] Updated: (ZOOKEEPER-775) A large scale pub/sub system

2010-08-19 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-775:


Attachment: ZOOKEEPER-775.patch

this patch make the following changes to the sumitted patch.

- removes ltmain.sh
- removes client/src/main/cpp/m4/ax_jni_include_dir.m4
- remove author names


 A large scale pub/sub system
 

 Key: ZOOKEEPER-775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-775
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 3.4.0

 Attachments: libs.zip, libs_2.zip, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775_2.patch, ZOOKEEPER-775_3.patch


 we have developed a large scale pub/sub system based on ZooKeeper and 
 BookKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-775) A large scale pub/sub system

2010-08-19 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-775:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
Release Note: A pub sub system using BooKkeeper and ZooKeeper with C++ and 
Java client bindings.
  Resolution: Fixed

I just committed this.

Thanks Ivan, Erwin, Ben. It would be great if you guys can focus on improved 
documentation for this since it will be critical for adoption of the project.



 A large scale pub/sub system
 

 Key: ZOOKEEPER-775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-775
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 3.4.0

 Attachments: libs.zip, libs_2.zip, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775_2.patch, ZOOKEEPER-775_3.patch


 we have developed a large scale pub/sub system based on ZooKeeper and 
 BookKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-775) A large scale pub/sub system

2010-08-18 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899930#action_12899930
 ] 

Mahadev konar commented on ZOOKEEPER-775:
-

also,

should I remove this file:

{code}
client/src/main/cpp/ltmain.sh
{code}

what about licensing on these files 

{code}
client/src/main/cpp/m4
client/src/main/cpp/m4/ax_doxygen.m4
client/src/main/cpp/m4/ax_jni_include_dir.m4
{code}

is it ok to distribute them via apache?



 A large scale pub/sub system
 

 Key: ZOOKEEPER-775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-775
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 3.4.0

 Attachments: libs.zip, libs_2.zip, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, ZOOKEEPER-775_2.patch, 
 ZOOKEEPER-775_3.patch


 we have developed a large scale pub/sub system based on ZooKeeper and 
 BookKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-775) A large scale pub/sub system

2010-08-18 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-775:


Comment: was deleted

(was: Mahadev, we've been using the internal hedwig/apache svn branch for 
making changes, at least until things are committed to Apache's SVN.  You can 
grab the latest changes there and incorporate them perhaps with your 
modifications.

http://svn.corp.yahoo.com/view/yahoo/yrl/hedwig/apache/

Ivan is correct, the p12.pass file is just a dummy password file used for 
testing SSL connections between client and server.  We need that otherwise the 
unit tests would fail.)

 A large scale pub/sub system
 

 Key: ZOOKEEPER-775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-775
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 3.4.0

 Attachments: libs.zip, libs_2.zip, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, ZOOKEEPER-775_2.patch, 
 ZOOKEEPER-775_3.patch


 we have developed a large scale pub/sub system based on ZooKeeper and 
 BookKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: High WTF count in ZooKeeper client code

2010-08-11 Thread Mahadev Konar

HI Thomas,
  I read through the list of issues you posted, most of them seem reasonable to 
fix. The one's you have mentioned below might take quite a bit of time to fix 
and again a lot of testing! (just a warning :)). It would be great if you'd 
want to clean this up for 3.4. Please go ahead and file a jira. These 
improvements would be good to have in the zookeeper java client.

For deleteRecursive, I definitely agree that it should be a helper class. I 
don't believe it should be in the direct zookeeper api!

Thanks
mahadev


On 8/11/10 2:45 AM, Thomas Koch tho...@koch.ro wrote:

Hi,

I started yesterday to work on my idea of an alternative ZooKeeper client
interface.[1] Instead of methods on a ZooKeeper class, a user should
instantiate an Operation (Create, Delete, ...) and forward it to an Executor
which handles session loss errors and alikes.

By doing that, I got shocked by the sheer number of WTF issues I found. I'm
sorry for ranting now, but it gets quicker to the poing.

- Hostlist as string

The hostlist is parsed in the ctor of ClientCnxn. This violates the rule of
not doing (too much) work in a ctor. Instead the ClientCnxn should receive an
object of class HostSet. HostSet could then be instantiated e.g. with a
comma separated string.

- cyclic dependency ClientCnxn, ZooKeeper

ZooKeeper instantiates ClientCnxn in its ctor with this and therefor builds a
cyclic dependency graph between both objects. This means, you can't have the
one without the other. So why did you bother do make them to separate classes
in the first place?
ClientCnxn accesses ZooKeeper.state. State should rather be a property of
ClientCnxn. And ClientCnxn accesses zooKeeper.get???Watches() in its method
primeConnection(). I've not yet checked, how this dependency should be
resolved better.

- Chroot is an attribute of ClientCnxn

I'd like to have one process that uses ZooKeeper for different things
(managing a list of work, locking some unrelated locks elsewhere). So I've
components that do this work inside the same process. These components should
get the same zookeeper-client reference chroot'ed for their needs.
So it'd be much better, if the ClientCnxn would not care about the chroot.

- deleteRecursive does not belong to the other methods

DeleteRecursive has been committed to trunk already as a method to the
zookeeper class. So in the API it has the same level as the atomic operations
create, delete, getData, setData, etc. The user must get the false impression,
that deleteRecursive is also an atomic operation.
It would be better to have deleteRecursive in some helper class but not that
deep in zookeeper's core code. Maybe I'd like to have another policy on how to
react if deleteRecursive fails in the middle of its work?

- massive code duplication in zookeeper class

Each operation calls validatePath, handles the chroot, calls ClientCnxn and
checks the return header for error. I'd like to address this with the
operation classes:
Each operation should receive a prechecked Path object. Calling ClientCnxn and
error checking is not (or only partly) the concern of the operation but of an
executor like class.

- stat is returned by parameter

Since one can return only one value in java it's the only choice to do so.
Still it feels more like C then like Java. However with operator classes one
could simply get the result values with getter functions after the execution.

- stat calls static method on org.apache.zookeeper.server.DataTree

It's a huge jump from client code to the internal server class DataTree.
Shouldn't there rather be some class related to the protobuffer stat class
that knows how to copy a stat?

- Session class?

Maybe it'd make sense to combine hostlist, sessionId, sessionPassword and
sessionTimeout in a Session class so that the ctor of ClientCnxn won't get too
long?

I may have missed some items. :-)

Once again, please excuse my harsh tone. May I put the above issues in jira
and would you accept (backwards compatible) patches for it for 3.4.0?

Zookeeper is a fascinating project. Cudos to the devs. I've only looked in the
client side code, which is what most users of zookeeper will ever see if they
see any zookeeper internal code at all. So it may make sense to make this
piece of the project as nice and clean as possible.

[1] http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-
dev/201005.mbox/%3c201005261509.54236.tho...@koch.ro%3e

Best regards,

Thomas Koch, http://www.koch.ro

Re: High WTF count in ZooKeeper client code

2010-08-11 Thread Mahadev Konar

Also, I am assuming you have backwards compatability in mind when you suggest 
these changes right?

The interfaces of zookeeper client should not be changing as part of this, 
though the recursive delete hasn't been introduced yet (its only available in 
3.4, so we can move it out into a helper class).

Thanks
mahadev


On 8/11/10 7:40 AM, Mahadev Konar maha...@yahoo-inc.com wrote:

HI Thomas,
  I read through the list of issues you posted, most of them seem reasonable to 
fix. The one's you have mentioned below might take quite a bit of time to fix 
and again a lot of testing! (just a warning :)). It would be great if you'd 
want to clean this up for 3.4. Please go ahead and file a jira. These 
improvements would be good to have in the zookeeper java client.

For deleteRecursive, I definitely agree that it should be a helper class. I 
don't believe it should be in the direct zookeeper api!

Thanks
mahadev


On 8/11/10 2:45 AM, Thomas Koch tho...@koch.ro wrote:

Hi,

I started yesterday to work on my idea of an alternative ZooKeeper client
interface.[1] Instead of methods on a ZooKeeper class, a user should
instantiate an Operation (Create, Delete, ...) and forward it to an Executor
which handles session loss errors and alikes.

By doing that, I got shocked by the sheer number of WTF issues I found. I'm
sorry for ranting now, but it gets quicker to the poing.

- Hostlist as string

The hostlist is parsed in the ctor of ClientCnxn. This violates the rule of
not doing (too much) work in a ctor. Instead the ClientCnxn should receive an
object of class HostSet. HostSet could then be instantiated e.g. with a
comma separated string.

- cyclic dependency ClientCnxn, ZooKeeper

ZooKeeper instantiates ClientCnxn in its ctor with this and therefor builds a
cyclic dependency graph between both objects. This means, you can't have the
one without the other. So why did you bother do make them to separate classes
in the first place?
ClientCnxn accesses ZooKeeper.state. State should rather be a property of
ClientCnxn. And ClientCnxn accesses zooKeeper.get???Watches() in its method
primeConnection(). I've not yet checked, how this dependency should be
resolved better.

- Chroot is an attribute of ClientCnxn

I'd like to have one process that uses ZooKeeper for different things
(managing a list of work, locking some unrelated locks elsewhere). So I've
components that do this work inside the same process. These components should
get the same zookeeper-client reference chroot'ed for their needs.
So it'd be much better, if the ClientCnxn would not care about the chroot.

- deleteRecursive does not belong to the other methods

DeleteRecursive has been committed to trunk already as a method to the
zookeeper class. So in the API it has the same level as the atomic operations
create, delete, getData, setData, etc. The user must get the false impression,
that deleteRecursive is also an atomic operation.
It would be better to have deleteRecursive in some helper class but not that
deep in zookeeper's core code. Maybe I'd like to have another policy on how to
react if deleteRecursive fails in the middle of its work?

- massive code duplication in zookeeper class

Each operation calls validatePath, handles the chroot, calls ClientCnxn and
checks the return header for error. I'd like to address this with the
operation classes:
Each operation should receive a prechecked Path object. Calling ClientCnxn and
error checking is not (or only partly) the concern of the operation but of an
executor like class.

- stat is returned by parameter

Since one can return only one value in java it's the only choice to do so.
Still it feels more like C then like Java. However with operator classes one
could simply get the result values with getter functions after the execution.

- stat calls static method on org.apache.zookeeper.server.DataTree

It's a huge jump from client code to the internal server class DataTree.
Shouldn't there rather be some class related to the protobuffer stat class
that knows how to copy a stat?

- Session class?

Maybe it'd make sense to combine hostlist, sessionId, sessionPassword and
sessionTimeout in a Session class so that the ctor of ClientCnxn won't get too
long?

I may have missed some items. :-)

Once again, please excuse my harsh tone. May I put the above issues in jira
and would you accept (backwards compatible) patches for it for 3.4.0?

Zookeeper is a fascinating project. Cudos to the devs. I've only looked in the
client side code, which is what most users of zookeeper will ever see if they
see any zookeeper internal code at all. So it may make sense to make this
piece of the project as nice and clean as possible.

[1] http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-
dev/201005.mbox/%3c201005261509.54236.tho...@koch.ro%3e

Best regards,

Thomas Koch, http://www.koch.ro

[jira] Commented: (ZOOKEEPER-795) eventThread isn't shutdown after a connection session expired event coming

2010-08-11 Thread Mahadev konar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897366#action_12897366
 ] 

Mahadev konar commented on ZOOKEEPER-795:
-

ben, pinging again, can you provide a patch for 3.3.2?

 eventThread isn't shutdown after a connection session expired event coming
 

 Key: ZOOKEEPER-795
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-795
 Project: Zookeeper
  Issue Type: Bug
  Components: java client
Affects Versions: 3.3.1
 Environment: ubuntu 10.04
Reporter: mathieu barcikowski
Assignee: Sergey Doroshenko
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ExpiredSessionThreadLeak.java, ZOOKEEPER-795.patch, 
 ZOOKEEPER-795.patch


 Hi,
 I notice a problem with the eventThread located in ClientCnxn.java file.
 The eventThread isn't shutdown after a connection session expired event 
 coming (i.e. never receive EventOfDeath).
 When a session timeout occurs and the session is marked as expired, the 
 connexion is fully closed (socket, SendThread...) expect for the eventThread.
 As a result, if i create a new zookeeper object and connect through it, I got 
 a zombi thread which will never be kill (as for the previous zookeeper 
 object, the state is already close, calling close again don't do anything).
 So everytime I will create a new zookeeper connection after a expired 
 session, I will have a one more zombi EventThread.
 How to reproduce :
 - Start a zookeeper client connection in debug mode
 - Pause the jvm enough time to the expired event occur
 - Watch for example with jvisualvm the list of threads, the sendThread is 
 succesfully killed, but the EventThread go to wait state for a infinity of 
 time
 - if you reopen a new zookeeper connection, and do again the previous steps, 
 another EventThread will be present in infinite wait state

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-772) zkpython segfaults when watcher from async get children is invoked.

2010-08-11 Thread Mahadev konar (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-772:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

Luckily the 3.4 patch applies to 3.3 branch. I just committed this. thanks 
henry!

 zkpython segfaults when watcher from async get children is invoked.
 ---

 Key: ZOOKEEPER-772
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-772
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-bindings
 Environment: ubuntu lucid (10.04) / zk trunk
Reporter: Kapil Thangavelu
Assignee: Henry Robinson
 Fix For: 3.3.2, 3.4.0

 Attachments: asyncgetchildren.py, zkpython-testasyncgetchildren.diff, 
 ZOOKEEPER-772.patch, ZOOKEEPER-772.patch


 When utilizing the zkpython async get children api with a watch, i 
 consistently get segfaults when the watcher is invoked to process events. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-780) zkCli.sh generates a ArrayIndexOutOfBoundsException


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-780:


Status: Open  (was: Patch Available)

 zkCli.sh  generates a ArrayIndexOutOfBoundsException 
 -

 Key: ZOOKEEPER-780
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-780
 Project: Zookeeper
  Issue Type: Bug
  Components: scripts
Affects Versions: 3.3.1
 Environment: Linux Ubuntu running in VMPlayer on top of Windows XP
Reporter: Miguel Correia
Assignee: Andrei Savu
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-780.patch, ZOOKEEPER-780.patch, 
 ZOOKEEPER-780.patch


 I'm starting to play with Zookeeper so I'm still running it in standalone 
 mode. This is not a big issue, but here it goes for the records. 
 I've run zkCli.sh to run some commands in the server. I created a znode 
 /groups. When I tried to create a znode client_1 inside /groups, I forgot to 
 include the data: an exception was generated and zkCli-sh crashed, instead of 
 just showing an error. I tried a few variations and it seems like the problem 
 is not including the data.
 A copy of the screen:
 [zk: localhost:2181(CONNECTED) 3] create /groups firstgroup
 Created /groups
 [zk: localhost:2181(CONNECTED) 4] create -e /groups/client_1
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 3
   at 
 org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:678)
   at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:581)
   at 
 org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:353)
   at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:311)
   at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:270)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-780) zkCli.sh generates a ArrayIndexOutOfBoundsException


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahadev konar updated ZOOKEEPER-780:


Status: Patch Available  (was: Open)

retriggereing the patch build.

 zkCli.sh  generates a ArrayIndexOutOfBoundsException 
 -

 Key: ZOOKEEPER-780
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-780
 Project: Zookeeper
  Issue Type: Bug
  Components: scripts
Affects Versions: 3.3.1
 Environment: Linux Ubuntu running in VMPlayer on top of Windows XP
Reporter: Miguel Correia
Assignee: Andrei Savu
Priority: Minor
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-780.patch, ZOOKEEPER-780.patch, 
 ZOOKEEPER-780.patch


 I'm starting to play with Zookeeper so I'm still running it in standalone 
 mode. This is not a big issue, but here it goes for the records. 
 I've run zkCli.sh to run some commands in the server. I created a znode 
 /groups. When I tried to create a znode client_1 inside /groups, I forgot to 
 include the data: an exception was generated and zkCli-sh crashed, instead of 
 just showing an error. I tried a few variations and it seems like the problem 
 is not including the data.
 A copy of the screen:
 [zk: localhost:2181(CONNECTED) 3] create /groups firstgroup
 Created /groups
 [zk: localhost:2181(CONNECTED) 4] create -e /groups/client_1
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 3
   at 
 org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:678)
   at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:581)
   at 
 org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:353)
   at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:311)
   at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:270)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-775) A large scale pub/sub system


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897121#action_12897121
 ] 

Mahadev konar commented on ZOOKEEPER-775:
-

also, you might want to get rid of this:

{code}
+% Developer's Guide
+% Yang Zhang
{code}
in src/contrib/hedwig/doc/dev.txt


Other than that this looks good.

One question about protobufs, is it apache license? What all are we committing 
in zookeeper svn thats related to protobufs?


 A large scale pub/sub system
 

 Key: ZOOKEEPER-775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-775
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 3.4.0

 Attachments: libs.zip, libs_2.zip, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, ZOOKEEPER-775_2.patch, 
 ZOOKEEPER-775_3.patch


 we have developed a large scale pub/sub system based on ZooKeeper and 
 BookKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-775) A large scale pub/sub system


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897122#action_12897122
 ] 

Mahadev konar commented on ZOOKEEPER-775:
-

sorry I am reviewing and making sure I am uploading my comments, it would have 
been better for me to write it down in a file and then uploaded at the same 
time :).

- make sure that you list the files which need to be executable in svn. While 
committing the svn executable flag is usually lost. So, you will have to 
provide with a list of files that need to have special svn flags.



 A large scale pub/sub system
 

 Key: ZOOKEEPER-775
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-775
 Project: Zookeeper
  Issue Type: New Feature
  Components: contrib
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Fix For: 3.4.0

 Attachments: libs.zip, libs_2.zip, ZOOKEEPER-775.patch, 
 ZOOKEEPER-775.patch, ZOOKEEPER-775.patch, ZOOKEEPER-775_2.patch, 
 ZOOKEEPER-775_3.patch


 we have developed a large scale pub/sub system based on ZooKeeper and 
 BookKeeper.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-775) A large scale pub/sub system