Re: [PR] [SPARK-51095][CORE][SQL] Include caller context for hdfs audit logs for calls from driver [spark]

via GitHub Fri, 07 Feb 2025 08:33:07 -0800


cnauroth commented on code in PR #49814:
URL: https://github.com/apache/spark/pull/49814#discussion_r1946824851



##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:
##########
@@ -171,6 +172,11 @@ private[hive] class HiveClientImpl(
   private def newState(): SessionState = {
     val hiveConf = newHiveConf(sparkConf, hadoopConf, extraConfig, 
Some(initClassLoader))
     val state = new SessionState(hiveConf)
+    // When SessionState is initialized, the caller context is overridden by 
hive
+    // so we need to reset it back to the DRIVER

Review Comment:
   The caller context is ultimately captured in thread-local storage and 
propagated to child threads:
   
   
https://github.com/apache/hadoop/blob/rel/release-3.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/CallerContext.java#L265
   
   Anything passing through Hive client initialization would override that with 
a Hive session ID, hence the need for the Spark code to restore it to "DRIVER":
   
   
https://github.com/apache/hive/blob/rel/release-4.0.1/shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java#L479
   
   @sririshindra , have you tried testing running multiple queries on Iceberg 
tables in the same Spark session? It might be interesting to see if Iceberg's 
Hive client has overridden it at the end, but we don't really see the effects 
until a second/third query.
   
   If we aren't seeing Iceberg's Hive client override it, then I think 
potential explanations could be: 1) the restore here is also past Iceberg's 
Hive client initialization, 2) Iceberg's client is on a separate thread with 
separate thread-local storage, or 3) Iceberg loads the client in an isolated 
classloader, so their `CallerContext` is not really the same as Spark's 
`CallerContext`. I don't know the Iceberg code path well enough to say if any 
of these theories are true though.
   
   From the perspective of HDFS/S3A auditing, what matters is the 
`CallerContext` at the time of the `FileSystem` API calls. The testing so far 
looks good for that.
   
   If we find there are broader problems with Spark's Hive clients leaking 
state, then an alternative solution might be to isolate 
`org.apache.hadoop.ipc.CallerContext` via the Hive client 
`IsolatedClientLoader`. Currently it treats all the Hadoop classes as shared.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-51095][CORE][SQL] Include caller context for hdfs audit logs for calls from driver [spark]

Reply via email to