cnauroth commented on code in PR #49814:
URL: https://github.com/apache/spark/pull/49814#discussion_r1946824851
##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##########
@@ -171,6 +172,11 @@ private[hive] class HiveClientImpl(
private def newState(): SessionState = {
val hiveConf = newHiveConf(sparkConf, hadoopConf, extraConfig,
Some(initClassLoader))
val state = new SessionState(hiveConf)
+ // When SessionState is initialized, the caller context is overridden by
hive
+ // so we need to reset it back to the DRIVER
Review Comment:
The caller context is ultimately captured in thread-local storage and
propagated to child threads:
https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallerContext.java#L265
Anything passing through Hive client initialization would override that with
a Hive session ID, hence the need for the Spark code to restore it to "DRIVER":
https://github.com/apache/hive/blob/rel/release-4.0.1/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java#L479
@sririshindra , have you tried testing running multiple queries on Iceberg
tables in the same Spark session? It might be interesting to see if Iceberg's
Hive client has overridden it at the end, but we don't really see the effects
until a second/third query.
If we aren't seeing Iceberg's Hive client override it, then I think
potential explanations could be: 1) the restore here is also past Iceberg's
Hive client initialization, 2) Iceberg's client is on a separate thread with
separate thread-local storage, or 3) Iceberg loads the client in an isolated
classloader, so their `CallerContext` is not really the same as Spark's
`CallerContext`. I don't know the Iceberg code path well enough to say if any
of these theories are true though.
From the perspective of HDFS/S3A auditing, what matters is the
`CallerContext` at the time of the `FileSystem` API calls. The testing so far
looks good for that.
If we find there are broader problems with Spark's Hive clients leaking
state, then an alternative solution might be to isolate
`org.apache.hadoop.ipc.CallerContext` via the Hive client
`IsolatedClientLoader`. Currently it treats all the Hadoop classes as shared.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]