[jira] [Commented] (ZOOKEEPER-4306) CloseSessionTxn contains too many ephemal nodes cause cluster crash

Lin Changrui (Jira) Thu, 17 Jun 2021 23:57:06 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365277#comment-17365277
 ]


Lin Changrui commented on ZOOKEEPER-4306:
-----------------------------------------

Hi [~ztzg],
Thanks for your contribution, and I have seen your commit. Add a new 
KeepExcetion and check it before create an ephemeral node could resolve the 
problem, it seems to be correct to me. I don't find any loopholes.
I think it‘s more easier for others notice this limitation if add some JavaDoc 
of ZooKeeper.create. Would you agree? :D

> CloseSessionTxn contains too many ephemal nodes cause cluster crash
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4306
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4306
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.2
>            Reporter: Lin Changrui
>            Priority: Critical
>         Attachments: cs.jpg, f.jpg, l1.png, l2.jpg, r.jpg
>
>
> We took a test about how many ephemal nodes can client create under one 
> parent node with defalut configuration. The test caused cluster crash at 
> last, exception stack trace like this.
> follower:
> !f.jpg!
> leader:
> !l1.png!
> !l2.jpg!
> It seems that leader sent a too large txn packet to followers. When follower 
> try to deserialize the txn, it found the txn length out of its buffer 
> size(default 1MB+1MB, jute.maxbuffer + jute.maxbuffer.extrasize). That causes 
> followers crashed, and then, leader found there was no sufficient followers 
> synced, so leader shutdown later. When leader shutdown, it called 
> zkDb.fastForwardDataBase() , and leader found the txn read from txnlog out of 
> its buffer size, so it crashed too.
> After the servers crashed, they try to restart the quorum. But they would not 
> success because the last txn is too large. We lose the log at that moment, 
> but the stack trace is same as this one.
> !r.jpg|width=1468,height=598!
>  
> *Root Cause*
> We use org.apache.zookeeper.server.LogFormatter(-Djute.maxbuffer=74827780) 
> visualize this log and found this. !cs.jpg|width=1400,height=581! So 
> closeSessionTxn contains all ephemal nodes with absolute path. We know we 
> will get a large getChildren respose if we create too many children nodes 
> under one parent node, that is limited by jute.maxbuffer of client. If we 
> create plenty of ephemal nodes under different parent nodes with one session, 
> it may not cause out of buffer of client, but when the session close without 
> delete these node first, it probably cause cluster crash.
> Is it a bug or just a unspecified feature？If it just so, how should we judge 
> the upper limit of creating nodes? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-4306) CloseSessionTxn contains too many ephemal nodes cause cluster crash

Reply via email to