[
https://issues.apache.org/jira/browse/STORM-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076694#comment-17076694
]
Aaron Gresch commented on STORM-3606:
-------------------------------------
user will see workers restart due to a NPE if they upload credentials before
the TGT renewal thread runs the kinit -R command.
{code:java}
020-04-01 14:36:53.005 o.a.s.u.Utils TGT Renewer for XXX [ERROR] Received error
in thread TGT Renewer for XXX.. terminating server... java.lang.Error:
java.lang.NullPointerException at
org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:694)
~[storm-client-2.2.0.y.jar:2.2.0.y] at
org.apache.storm.utils.Utils.handleUncaughtException(Utils.java:673)
~[storm-client-2.2.0.y.jar:2.2.0.y] at
org.apache.storm.utils.Utils.lambda$createDefaultUncaughtExceptionHandler$2(Utils.java:1055)
~[storm-client-2.2.0.y.jar:2.2.0.y] at
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java) at
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java) at
java.lang.Thread.dispatchUncaughtException(Thread.java) Caused by:
java.lang.NullPointerException at
org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:1031)
~[stormjar.jar: ?] at java.lang.Thread.run(Thread.java) 2020-04-01
14:36:53.018 o.a.s.u.Utils Thread-23 [INFO] Halting after 3 seconds 2020-04-01
14:36:53.019 o.a.s.d.w.Worker Thread-24 [INFO] Shutting down worker XXX
{code}
Sequence:
1) Hadoop thread grabs the initial TGT:
https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L978
2) Hadoop then sleeps:
https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L992
3) kinit -R runs:
https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L994
The kinit will fail and generate an IOException. Then we get to this line that
accesses the original TGT:
https://github.com/apache/hadoop/blob/branch-2.9/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1014
But since we cleared credentials on an upload credentials this will cause the
NPE, which then restarts the worker.
> AutoTGT shouldn't invoke TGT renewal thread (from
> UserGroupInformation.loginUserFromSubject)
> --------------------------------------------------------------------------------------------
>
> Key: STORM-3606
> URL: https://issues.apache.org/jira/browse/STORM-3606
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 2.0.0, 1.2.3, 2.1.0
> Reporter: Ethan Li
> Assignee: Aaron Gresch
> Priority: Minor
>
> When hadoop security is enabled,
> https://github.com/apache/storm/blob/master/storm-client/src/jvm/org/apache/storm/security/auth/kerberos/AutoTGT.java#L199-L209
> AutoTGT will invoke "loginUserFromSubject", and it will spawn a TGT renewal
> thread ("TGT Renewer for <username>").
> https://github.com/apache/hadoop/blob/branch-2.8.5/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L928-L957
> which will eventually invoke system command "kinit -R", and then fail with
> the exception
> {code:java}
> org.apache.hadoop.util.Shell$ExitCodeException: kinit: Credentials cache file
> '/tmp/krb5cc_xxx' not found while renewing credentials
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:1004)
> ~[stormjar.jar:?]
> at org.apache.hadoop.util.Shell.run(Shell.java:898) ~[stormjar.jar:?]
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
> ~[stormjar.jar:?]
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1307)
> ~[stormjar.jar:?]
> at org.apache.hadoop.util.Shell.execCommand(Shell.java:1289)
> ~[stormjar.jar:?]
> at
> org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:1011)
> [stormjar.jar:?]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
> {code}
> "kinit" will never work from worker process since Storm don't keep TGT in
> local cache. Instead, TGT is saved in zookeeper and in memory of Worker
> process.
> This exception is confusing but not harmful to topologies. And the TGT
> renewal thread will eventually abort.
> It's better to find a real solution for it. But for now we can document what
> might happen in AutoTGT code.
> To be clear, we still need loginUserFromSubject or some sort but we don't
> want to spawn TGT renewal thread. This is found with hadoop-2.8.5. Other
> versions are similar. But it can also change in the future release.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)