roczei opened a new pull request, #37679: URL: https://github.com/apache/spark/pull/37679
### What changes were proposed in this pull request? This PR is a follow-up PR for SPARK-37731. Previous has been closed by github-actions: https://github.com/apache/spark/pull/32364 My changes: - Rebased / updated the previous PR to the latest master branch version - Deleted the DEFAULT_DATABASE static member from sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala and refactored the code regarding this ### Why are the changes needed? If our user does not have any permissions for the Hive default database in Ranger, it will fail with the following error: ``` 22/08/26 18:36:21 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient [email protected] (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) ``` The idea is that we introduce a new configuration parameter where we can set a different database name for the default database. Our user has enough permissions for this in Ranger. For example: ```spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other_db``` ### Does this PR introduce _any_ user-facing change? There will be a new configuration parameter as I mentioned above but the default value is "default" as it was previously. ### How was this patch tested? 1) With github action (all tests passed) https://github.com/roczei/spark/actions/runs/2935626152 2) Manually tested with Ranger + Hive Scenario a) hrt_10 does not have access to the default database in Hive: ``` [hrt_10@quasar-thbnqr-2 ~]$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:14:18 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:14:30 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-17]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:18:47 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:18:48 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:18:48 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:48 INFO SessionState: [main]: Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:18:50 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient [email protected] (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:179) ``` This is the expected behavior because it will use the "default" db name. Scenario b) Use the "other" database where the hrt_10 user has proper permissions ``` [hrt_10@quasar-thbnqr-2 ~]$ spark3-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:27:03 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:27:14 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-15]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:29:22 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:29:22 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:29:22 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2 22/08/26 18:29:22 INFO SessionState: [main]: Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:29:24 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient [email protected] (auth:KERBEROS) retries=1 delay=1 lifetime=0 res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from employee").show() +---+----+------+-----------+ |eid|name|salary|destination| +---+----+------+-----------+ | 12| Ram| 10| Szeged| | 13| Joe| 20| Debrecen| +---+----+------+-----------+ scala> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
