[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229852#comment-15229852
 ] 

Vinayakumar B commented on ZOOKEEPER-2175:
------------------------------------------

bq. By seeing this I still have doubt in my mind why both NN has transition to 
standby, but I couldn't find the reason what has happened to the ZooKeeper 
client due to the sessionId corruption.
Because of following,
1. ZK-client has established session - and session Id returned to client. Now, 
returned sessionId differs from server's sessionId due to bit corruption.
2. ZK will create the ephemeral-Node. Node creation success at server side and 
call back also success. So transition to Active.
3. Now, ZK-client will set the exists-watcher, and verify the stat of return 
callback. While verifying it verifies sessionId too. Now it differs, hence 
current Active changed back to standby.
4. ZK-client is still having active session and it continues to heartbeat to 
zk-server. So server side session also active. Not expired. So corrupted 
session is active at both server and client.

Either one of this can be done to make some progress IMO.
1. To fix this specific sessionId corruption case, can include sessionId in 
heartbeat messages to server, and validate in server side. Invalid/corrupted 
sessions can be closed immediately.
2. Generic checksum validation for all the request/response packets for the 
communications between ZK-client/server. As Rakesh said, this might have 
performance impacts.

So, #1 should be relatively easy and feasible. 
Please give your opinions on this.


> Checksum validation for malformed packets needs to handle.
> ----------------------------------------------------------
>
>                 Key: ZOOKEEPER-2175
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2175
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Brahma Reddy Battula
>
>  *Session Id from ZK :* 
> 2015-04-15 21:24:54,257 | INFO  | CommitProcessor:22 | Established session 
> 0x164cb2b3e4b36ae4 with negotiated timeout 45000 for client 
> /160.149.0.117:44586 | 
> org.apache.zookeeper.server.ZooKeeperServer.finishSessionInit(ZooKeeperServer.java:623)
> 2015-04-15 21:24:54,261 | INFO  | 
> NIOServerCxn.Factory:160-149-0-114/160.149.0.114:24002 | Successfully 
> authenticated client: authenticationID=hdfs/[email protected];  
> authorizationID=hdfs/[email protected]. | 
> org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:118)
> 2015-04-15 21:24:54,261 | INFO  | 
> NIOServerCxn.Factory:160-149-0-114/160.149.0.114:24002 | Setting 
> authorizedID: hdfs/[email protected] | 
> org.apache.zookeeper.server.auth.SaslServerCallbackHandler.handleAuthorizeCallback(SaslServerCallbackHandler.java:134)
> 2015-04-15 21:24:54,261 | INFO  | 
> NIOServerCxn.Factory:160-149-0-114/160.149.0.114:24002 | adding SASL 
> authorization for authorizationID: hdfs/[email protected] | 
> org.apache.zookeeper.server.ZooKeeperServer.processSasl(ZooKeeperServer.java:1009)
> 2015-04-15 21:24:54,262 | INFO  | ProcessThread(sid:22 cport:-1): | Got 
> user-level KeeperException when processing  
> *{color:red}sessionid:0x164cb2b3e4b36ae4{color}*  type:create cxid:0x3 
> zxid:0x20009fafc txntype:-1 reqpath:n/a Error 
> Path:/hadoop-ha/hacluster/ActiveStandbyElectorLock Error:KeeperErrorCode = 
> NodeExists for /hadoop-ha/hacluster/ActiveStandbyElectorLock | 
> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:648)
>  *ZKFC Received :*  ZK client
> 2015-04-15 21:24:54,237 | INFO  | main-SendThread(160-149-0-114:24002) | 
> Socket connection established to 160-149-0-114/160.149.0.114:24002, 
> initiating session | 
> org.apache.zookeeper.ClientCnxn$SendThread.primeConnection(ClientCnxn.java:854)
> 2015-04-15 21:24:54,257 | INFO  | main-SendThread(160-149-0-114:24002) | 
> Session establishment complete on server 160-149-0-114/160.149.0.114:24002,  
> *{color:blue}sessionid = 0x144cb2b3e4b36ae4 {color}* , negotiated timeout = 
> 45000 | 
> org.apache.zookeeper.ClientCnxn$SendThread.onConnected(ClientCnxn.java:1259)
> 2015-04-15 21:24:54,260 | INFO  | main-EventThread | EventThread shut down | 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:512)
> 2015-04-15 21:24:54,262 | INFO  | main-EventThread | Session connected. | 
> org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:547)
> 2015-04-15 21:24:54,264 | INFO  | main-EventThread | Successfully 
> authenticated to ZooKeeper using SASL. | 
> org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:573)
> one bit corrupted..please check the following for same..
> 144cb2b3e4b36ae4=1010001001100101100101011001111100100101100110110101011100100
> 164cb2b3e4b36ae4=1011001001100101100101011001111100100101100110110101011100100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to