[jira] [Comment Edited] (ZOOKEEPER-4885) Can Non-SASL-Clients automatically recover with the recovery of kerberos communication？

Xin Chen (Jira) Wed, 13 Nov 2024 18:12:04 -0800


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898105#comment-17898105
 ]


Xin Chen edited comment on ZOOKEEPER-4885 at 11/14/24 2:11 AM:
---------------------------------------------------------------

So {*}there is a real scenario in the production environment{*}:

When a *Curator Client* used to create a EphemeralNode，like code in HiveServer2：

 
{code:java}
zooKeeperClient =
    
CuratorFrameworkFactory.builder().connectString(zooKeeperEnsemble).sessionTimeoutMs(sessionTimeout).aclProvider(zooKeeperAclProvider).retryPolicy(new
 ExponentialBackoffRetry(baseSleepTime, maxRetries)).build();
zooKeeperClient.start();

PersistentEphemeralNode znode = new PersistentEphemeralNode(zooKeeperClient, 
PersistentEphemeralNode.Mode.EPHEMERAL_SEQUENTIAL, pathPrefix, znodeDataUTF8);
znode.start();{code}
There is a callback function in PersistentEphemeralNode：
{code:java}
backgroundCallback = new BackgroundCallback()
{
    @Override
    public void processResult(CuratorFramework client, CuratorEvent event) 
throws Exception
    {
        String path = null;
        if ( event.getResultCode() == 
KeeperException.Code.NODEEXISTS.intValue() )
        {
            path = event.getPath();
        }
        else if ( event.getResultCode() == KeeperException.Code.OK.intValue() )
        {
            path = event.getName();
        }
        if ( path != null )
        {
            nodePath.set(path);
            watchNode();

            CountDownLatch localLatch = initialCreateLatch.getAndSet(null);
            if ( localLatch != null )
            {
                localLatch.countDown();
            }
        }
        else
        {
            createNode();
        }
    }
};

// createNode() registers this backgroundCallback again：
createMethod.withMode(mode.getCreateMode(existingPath != 
null)).inBackground(backgroundCallback).forPath(createPath, data.get());{code}
When HiveServer2 is disconnected from the Zookeeper and then network 
reconnected, Curator automatically rebuilds the client. Coincidentally, the 
login to kerberos fails at this time, and a non SASL authenticated client is 
created. The Curator uses this client to rebuild PersistentEphemeralNode under 
“/hive”.  There will be an error message: 
{code:java}
KeeperErrorCode = InvalidACL for /hive/hiveserver2-0... {code}
" when creating, and only the judgment for NODEEXISTS and OK will be made in 
{*}backgroundCallback#processResult{*}. In this case, it will repeatedly call 
creatNode() and register backgroundcallback. After the error callback, it will 
call creatNode() and register backgroundcallback again. So it enters a dead 
loop, due to the callback processing within the Curator, the zk client of 
hiveserver2 is completely unaware, while the server will frantically brush 
InvalidACL logs due to the fast call of creatNode(). The CPU pressure on the zk 
server will also increase dramatically. 

 

 

 


was (Author: JIRAUSER298666):
So {*}there is a real scenario in the production environment{*}:

When a *Curator Client* used to create a EphemeralNode，like code in HiveServer2：

 
{code:java}
zooKeeperClient =
    
CuratorFrameworkFactory.builder().connectString(zooKeeperEnsemble).sessionTimeoutMs(sessionTimeout).aclProvider(zooKeeperAclProvider).retryPolicy(new
 ExponentialBackoffRetry(baseSleepTime, maxRetries)).build();
zooKeeperClient.start();

PersistentEphemeralNode znode = new PersistentEphemeralNode(zooKeeperClient, 
PersistentEphemeralNode.Mode.EPHEMERAL_SEQUENTIAL, pathPrefix, znodeDataUTF8);
znode.start();{code}
There is a callback function in PersistentEphemeralNode：
{code:java}
backgroundCallback = new BackgroundCallback()
{
    @Override
    public void processResult(CuratorFramework client, CuratorEvent event) 
throws Exception
    {
        String path = null;
        if ( event.getResultCode() == 
KeeperException.Code.NODEEXISTS.intValue() )
        {
            path = event.getPath();
        }
        else if ( event.getResultCode() == KeeperException.Code.OK.intValue() )
        {
            path = event.getName();
        }
        if ( path != null )
        {
            nodePath.set(path);
            watchNode();

            CountDownLatch localLatch = initialCreateLatch.getAndSet(null);
            if ( localLatch != null )
            {
                localLatch.countDown();
            }
        }
        else
        {
            createNode();
        }
    }
};

// createNode() registers this backgroundCallback again：
createMethod.withMode(mode.getCreateMode(existingPath != 
null)).inBackground(backgroundCallback).forPath(createPath, data.get());{code}
When HiveServer2 is disconnected from the Zookeeper and then network 
reconnected, Curator automatically rebuilds the client. Coincidentally, the 
login to kerberos fails at this time, and a non SASL authenticated client is 
created. The Curator uses this client to rebuild PersistentEphemeralNode under 
“/hive”.  There will be an error message: 
KeeperErrorCode = InvalidACL for /hive/hiveserver2-0...
" when creating, and only the judgment for NODEEXISTS and OK will be made in 
{*}backgroundCallback#processResult{*}. In this case, it will repeatedly call 
creatNode() and register backgroundcallback. After the error callback, it will 
call creatNode() and register backgroundcallback again. So it enters a dead 
loop, due to the callback processing within the Curator, the zk client of 
hiveserver2 is completely unaware, while the server will frantically brush 
InvalidACL logs due to the fast call of creatNode(). The CPU pressure on the zk 
server will also increase dramatically. 

 

 

 

> Can Non-SASL-Clients automatically recover with the recovery of kerberos 
> communication？
> ---------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4885
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4885
>             Project: ZooKeeper
>          Issue Type: Improvement
>    Affects Versions: 3.4.14, 3.6.4, 3.9.3
>            Reporter: Xin Chen
>            Priority: Major
>
> About  ZOOKEEPER-2139 & ZOOKEEPER-2323, it just avoids ZooKeeper clients into 
> infinite AuthFailedException. Noauth Exception still exists! 
> LoginException was thrown through each login, but at this point, a zkclient 
> without Kerberos SASL authentication was created. Non SASL Znodes can be 
> operated on in the future. However, when Kerberos recovers from network 
> disconnections and other anomalies, the previously created zkclient without 
> SASL authentication is still being used without rebuilding the login or 
> recreating a saslclient. If it is used to operate on ACL Znodes at this time, 
> an error will always be reported: 
> {code:java}
> KeeperErrorCode = NoAuth for /zookeeper
> or
> KeeperErrorCode = AuthFailed for /zookeeper
> or
> KeeperErrorCode = InvalidACL for /zookeeper{code}
> Isn't this a question that should be considered?  And I also met this issue 
> in ZK-3.6.4，It seems that this issue has not been considered in the updated 
> version. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ZOOKEEPER-4885) Can Non-SASL-Clients automatically recover with the recovery of kerberos communication？

Reply via email to