[
https://issues.apache.org/jira/browse/HBASE-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Junegunn Choi resolved HBASE-30101.
-----------------------------------
Fix Version/s: 2.7.0
3.0.0-beta-2
2.5.15
2.6.6
Resolution: Fixed
Fix pushed to all active branches:
* master
* branch-3
* branch-2
* branch-2.6
* branch-2.5
Thanks to [~zhangduo] for the review!
> Stray TGT Renewer from RpcServer accessing UGI before Kerberos login
> --------------------------------------------------------------------
>
> Key: HBASE-30101
> URL: https://issues.apache.org/jira/browse/HBASE-30101
> Project: HBase
> Issue Type: Bug
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Minor
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.5.15, 2.6.6
>
> Attachments: image-2026-04-22-13-47-59-850.png
>
>
> h2. Problem
> The {{RpcServer}} can access {{UserGroupInformation}} before Kerberos login
> completes during startup. When this happens, UGI bootstraps from the ticket
> cache and spawns a stray {{TGT Renewer}} thread for whichever principal
> happens to be there, even when it does not match the principal the server is
> configured to use.
> Two independent code paths trigger this:
> h3. 1. NettyRpcServer accepts connections before start()
> {{NettyRpcServer}} binds the server socket in its constructor, but auth setup
> (SASL secret manager, authorization manager, scheduler) only runs in
> {{start()}}. In that window, netty workers can accept connections and run
> handler code that reaches into UGI before the main thread has finished
> Kerberos login.
> Fixed in [PR #8110|https://github.com/apache/hbase/pull/8110]: disable
> {{AUTO_READ}} on the server channel at bootstrap, re-enable at the end of
> {{start()}}.
> h3. 2. RpcServer constructor reads UGI (HBase 2.6+)
> Since [HBASE-28321|https://issues.apache.org/jira/browse/HBASE-28321] (PR
> [#5688|https://github.com/apache/hbase/pull/5688]), the {{RpcServer}}
> constructor calls:
> {code:java}
> serverPrincipal =
> Preconditions.checkNotNull(userProvider.getCurrentUserName(),
> "can not get current user name when security is enabled");
> {code}
> {{userProvider.getCurrentUserName()}} calls
> {{UserGroupInformation.getCurrentUser()}}, which picks up whatever principal
> is in the ticket cache if the keytab login has not yet happened. Because this
> happens at construction time, before any inbound connection, the
> connection-side fix in path 1 cannot prevent the stray {{TGT Renewer}} here.
> Fixed in [PR #8122|https://github.com/apache/hbase/pull/8122]: resolve the
> hostname up front via {{DNS.getHostname}} and run ZK client and server logins
> before {{createRpcServices()}}.
> h2. Observed symptoms
> h3. HBase 2.4 (path 1)
> Handler code error:
> {code}
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-3]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,661 DEBUG [RS-EventLoopGroup-1-2]
> provider.GssSaslServerAuthenticationProvider: Server's Kerberos principal
> name is hbase
> 2026-02-02 17:06:51,662 ERROR [RS-EventLoopGroup-1-2] ipc.RpcServer: Error
> when trying to create instance of HBaseSaslRpcServer with sasl provider:
> org.apache.hadoop.hbase.security.provider.GssSaslServerAuthenticationProvider@4cf94801
> org.apache.hadoop.hbase.security.AccessDeniedException: Kerberos principal
> does NOT contain an instance (hostname): hbase
> {code}
> Stray {{TGT Renewer}} thread, continuously emitting:
> {code}
> java.io.IOException: Cannot run program "kinit": error=2, No such file or
> directory
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1170)
> at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1089)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
> at org.apache.hadoop.util.Shell.run(Shell.java:900)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1212)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1306)
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1288)
> at
> org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupInformation.java:1061)
> at
> org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:972)
> {code}
> Connections accepted before keytab login completes:
> {code}
> 2026-04-20 16:31:07,446 INFO [main] ipc.NettyRpcServer: Bind to
> /10.228.198.67:11471
> 2026-04-20 16:31:07,458 TRACE [RS-EventLoopGroup-1-2] ipc.NettyRpcServer:
> Connection /10.227.143.27:31898; # active connections=0
> 2026-04-20 16:31:07,466 TRACE [RS-EventLoopGroup-1-3] ipc.NettyRpcServer:
> Connection /10.192.163.175:2080; # active connections=1
> 2026-04-20 16:31:07,472 TRACE [RS-EventLoopGroup-1-4] ipc.NettyRpcServer:
> Connection /10.227.225.174:25343; # active connections=2
> 2026-04-20 16:31:07,496 TRACE [RS-EventLoopGroup-1-5] ipc.NettyRpcServer:
> Connection /10.192.163.189:32775; # active connections=3
> 2026-04-20 16:31:07,497 TRACE [RS-EventLoopGroup-1-6] ipc.NettyRpcServer:
> Connection /10.192.163.203:43663; # active connections=4
> 2026-04-20 16:31:07,583 INFO [main] security.UserGroupInformation: Login
> successful for user ***/*** using keytab file hbase.keytab. Keytab auto
> renewal enabled : false
> {code}
> h3. HBase 2.6+ (path 2)
> Stray {{TGT Renewer}} for a principal that does not match the server's
> configured principal:
> {noformat}
> 2026-04-24T10:27:07,173 DEBUG [TGT Renewer for [email protected]]
> security.UserGroupInformation: Current time is 1776994027173, next refresh is
> 1777028575000
> {noformat}
> Thread dump showing the renewer spawned from the {{RpcServer}} constructor
> (before {{login()}}):
> {noformat}
> Thread 28 (TGT Renewer for [email protected]):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 1
> Stack:
> [email protected]/java.lang.Thread.sleep(Native Method)
>
> app//org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroupInformation.java:982)
>
> [email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> [email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [email protected]/java.lang.Thread.run(Thread.java:829)
> {noformat}
> h2. Fixes are complementary
> PR [#8110|https://github.com/apache/hbase/pull/8110] and PR
> [#8122|https://github.com/apache/hbase/pull/8122] should be merged together:
> * PR #8122 fixes the initialization order so that Kerberos login happens
> before the {{RpcServer}} is constructed. This is the root-cause fix and
> closes both paths on 2.6+.
> * PR #8110 enforces a separate invariant on {{NettyRpcServer}}: do not accept
> connections before {{start()}} completes auth setup (SASL secret manager,
> authorization manager, scheduler). It is defense-in-depth, covers path 1 on
> releases without PR #8122 (e.g. 2.4), and ships with a regression test for
> the invariant.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)