ruiliang created FLINK-39274:
--------------------------------
Summary: TM It is impossible to bypass the KDC login process, yet
the TOKEN issued by AM has not been actually utilized.
Key: FLINK-39274
URL: https://issues.apache.org/jira/browse/FLINK-39274
Project: Flink
Issue Type: Bug
Affects Versions: 1.17.2
Environment: flink on yarn
Reporter: ruiliang
>From the document, it can be seen that the allocation did not distinguish
>between AM and TM.
flink-conf.yaml
{code:java}
security.kerberos.login.keytab=xx.keytab
security.kerberos.login.principal=xx_principal{code}
launch_container.sh
{code:java}
# It is clearly evident here that AM has successfully issued the TOKEN.
export
HADOOP_TOKEN_FILE_LOCATION="/data2/hadoop/yarn/local/usercache/hiidoagent/appcache/application_1773803886076_15646/container_e268_1773803886076_15646_01_000003/container_tokens"
..
# But keytab files will still be downloaded here.
export
_REMOTE_KEYTAB_PATH="hdfs://xx/user/hiidoagent/.flink/application_1773803886076_15646/hiidoagent.keytab"
export HADOOP_USER_NAME="[email protected]"
export _LOCAL_KEYTAB_PATH="krb5.keytab"
export _KEYTAB_PRINCIPAL="hiidoagent"{code}
TM log
{code:java}
2026-03-18 17:49:23,394 INFO
org.apache.flink.runtime.state.changelog.StateChangelogStorageLoader [] -
StateChangelogStorageLoader initialized with shortcut names {memory,filesystem}.
2026-03-18 17:49:23,441 INFO
org.apache.flink.runtime.security.token.hadoop.KerberosLoginProvider [] -
Attempting to login to KDC using principal: hiidoagent keytab:
/data2/hadoop/yarn/local/usercache/hiidoagent/appcache/application_1773803886076_15646/container_e268_1773803886076_15646_01_000003/krb5.keytab
2026-03-18 17:49:23,717 INFO org.apache.hadoop.security.UserGroupInformation
[] - Login successful for user hiidoagent using keytab file
/data2/hadoop/yarn/local/usercache/hiidoagent/appcache/application_1773803886076_15646/container_e268_1773803886076_15646_01_000003/krb5.keytab
2026-03-18 17:49:23,717 INFO
org.apache.flink.runtime.security.token.hadoop.KerberosLoginProvider [] -
Successfully logged into KDC
2026-03-18 17:49:23,719 INFO
org.apache.flink.runtime.security.modules.HadoopModule [] - Starting TGT
renewal task
2026-03-18 17:49:23,719 INFO
org.apache.flink.runtime.security.modules.HadoopModule [] - TGT renewal
task started and reoccur in 60000 ms
2026-03-18 17:49:23,719 INFO
org.apache.flink.runtime.security.modules.HadoopModule [] - Hadoop user
set to [email protected] (auth:KERBEROS)
2026-03-18 17:49:23,720 INFO
org.apache.flink.runtime.security.modules.HadoopModule [] - Kerberos
security is enabled.
2026-03-18 17:49:23,720 INFO
org.apache.flink.runtime.security.modules.HadoopModule [] - Kerberos
credentials are valid.
2026-03-18 17:49:23,726 INFO
org.apache.flink.runtime.security.modules.JaasModule [] - Jaas file
will be created as
/data1/hadoop/yarn/local/usercache/hiidoagent/appcache/application_1773803886076_15646/jaas-7581660068545285667.conf.
...
2026-03-18 17:49:25,228 INFO
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled
external resources: []
2026-03-18 17:49:25,229 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Loading delegation token receivers
2026-03-18 17:49:25,232 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Delegation token receiver hadoopfs loaded and initialized
2026-03-18 17:49:25,233 INFO
org.apache.flink.runtime.security.token.DelegationTokenReceiverRepository [] -
Delegation token receiver hbase loaded and initialized {code}
代码:
[https://github.com/apache/flink/blob/6fc5c97ec3a89975ee44b1b084efc8fbc25c73ee/flink-yarn/src/main/java/org/apache/flink/yarn/YarnTaskExecutorRunner.java#L132]
Looking at the source code, there is no configuration or judgment logic in the
code. Here, we should configure controllability instead of writing it
completely in a fixed manner.
KDC
The concurrent volume of KDC = number of Flink apps * total number of
containers.
If it involves a large number of short-term Flink tasks, this will be a fatal
pressure on KDC. KDC will become severely sluggish and affect the overall
security and stability of the cluster.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)