[ 
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junegunn Choi resolved HBASE-30101.
-----------------------------------
    Fix Version/s: 2.7.0
                   3.0.0-beta-2
                   2.5.15
                   2.6.6
       Resolution: Fixed

Fix pushed to all active branches:

* master
* branch-3
* branch-2
* branch-2.6
* branch-2.5

Thanks to [~zhangduo] for the review!


> Stray TGT Renewer from RpcServer accessing UGI before Kerberos login
> --------------------------------------------------------------------
>
>                 Key: HBASE-30101
>                 URL: https://issues.apache.org/jira/browse/HBASE-30101
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.5.15, 2.6.6
>
>         Attachments: image-2026-04-22-13-47-59-850.png
>
>
> h2. Problem
> The {{RpcServer}} can access {{UserGroupInformation}} before Kerberos login 
> completes during startup. When this happens, UGI bootstraps from the ticket 
> cache and spawns a stray {{TGT Renewer}} thread for whichever principal 
> happens to be there, even when it does not match the principal the server is 
> configured to use.
> Two independent code paths trigger this:
> h3. 1. NettyRpcServer accepts connections before start()
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup 
> (SASL secret manager, authorization manager, scheduler) only runs in 
> {{start()}}. In that window, netty workers can accept connections and run 
> handler code that reaches into UGI before the main thread has finished 
> Kerberos login.
> Fixed in [PR #8110|https://github.com/apache/hbase/pull/8110]: disable 
> {{AUTO_READ}} on the server channel at bootstrap, re-enable at the end of 
> {{start()}}.
> h3. 2. RpcServer constructor reads UGI (HBase 2.6+)
> Since [HBASE-28321|https://issues.apache.org/jira/browse/HBASE-28321] (PR 
> [#5688|https://github.com/apache/hbase/pull/5688]), the {{RpcServer}} 
> constructor calls:
> {code:java}
> serverPrincipal = 
> Preconditions.checkNotNull(userProvider.getCurrentUserName(),
>   "can not get current user name when security is enabled");
> {code}
> {{userProvider.getCurrentUserName()}} calls 
> {{UserGroupInformation.getCurrentUser()}}, which picks up whatever principal 
> is in the ticket cache if the keytab login has not yet happened. Because this 
> happens at construction time, before any inbound connection, the 
> connection-side fix in path 1 cannot prevent the stray {{TGT Renewer}} here.
> Fixed in [PR #8122|https://github.com/apache/hbase/pull/8122]: resolve the 
> hostname up front via {{DNS.getHostname}} and run ZK client and server logins 
> before {{createRpcServices()}}.
> h2. Observed symptoms
> h3. HBase 2.4 (path 1)
> Handler code error:
> {code}
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2] 
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal 
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error 
> when trying to create instance of HBaseSaslRpcServer with sasl provider: 
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal 
> does NOT contain an instance (hostname): hbase
> {code}
> Stray {{TGT Renewer}} thread, continuously emitting:
> {code}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or 
> directory
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
>     at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
>     at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
>     at org.apache.hadoop.util.Shell.run(Shell.java:900)
>     at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
>     at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
>     at 
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
>     at 
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
> {code}
> Connections accepted before keytab login completes:
> {code}
> 2026-04-20 16:31:07,446 INFO  [main] ipc.NettyRpcServer: Bind to 
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer: 
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer: 
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer: 
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer: 
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer: 
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,583 INFO  [main] security.UserGroupInformation: Login 
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto 
> renewal enabled : false
> {code}
> h3. HBase 2.6+ (path 2)
> Stray {{TGT Renewer}} for a principal that does not match the server's 
> configured principal:
> {noformat}
> 2026-04-24T10:27:07,173 DEBUG [TGT Renewer for [email protected]] 
> security.UserGroupInformation: Current time is 1776994027173, next refresh is 
> 1777028575000
> {noformat}
> Thread dump showing the renewer spawned from the {{RpcServer}} constructor 
> (before {{login()}}):
> {noformat}
> Thread 28 (TGT Renewer for [email protected]):
>   State: TIMED_WAITING
>   Blocked count: 0
>   Waited count: 1
>   Stack:
>     [email protected]/java.lang.Thread.sleep(Native Method)
>     
> app//org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:982)
>     
> [email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     [email protected]/java.lang.Thread.run(Thread.java:829)
> {noformat}
> h2. Fixes are complementary
> PR [#8110|https://github.com/apache/hbase/pull/8110] and PR 
> [#8122|https://github.com/apache/hbase/pull/8122] should be merged together:
> * PR #8122 fixes the initialization order so that Kerberos login happens 
> before the {{RpcServer}} is constructed. This is the root-cause fix and 
> closes both paths on 2.6+.
> * PR #8110 enforces a separate invariant on {{NettyRpcServer}}: do not accept 
> connections before {{start()}} completes auth setup (SASL secret manager, 
> authorization manager, scheduler). It is defense-in-depth, covers path 1 on 
> releases without PR #8122 (e.g. 2.4), and ships with a regression test for 
> the invariant.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to