[GitHub] [spark] roczei opened a new pull request, #37679: [SPARK-35242][FOLLOWUP][Ranger][Hive][default db] Spark should not rely on the 'default' hive database.

GitBox Fri, 26 Aug 2022 12:41:55 -0700


roczei opened a new pull request, #37679:
URL: https://github.com/apache/spark/pull/37679


   ### What changes were proposed in this pull request?
   
   This PR is a follow-up PR for SPARK-37731. Previous has been closed by 
github-actions: https://github.com/apache/spark/pull/32364
   
   My changes:
   
   - Rebased / updated the previous PR to the latest master branch version
   - Deleted the DEFAULT_DATABASE  static member from 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 and refactored the code regarding this
   
   ### Why are the changes needed?
   
   If our user does not have any permissions for the Hive default database in 
Ranger, it will fail with the following error:
   
   ```
   22/08/26 18:36:21 INFO  metastore.RetryingMetaStoreClient: [main]: 
RetryingMetaStoreClient proxy=class 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient 
[email protected] (auth:KERBEROS) retries=1 delay=1 lifetime=0
   org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:Permission denied: user [hrt_10] does not have [USE] 
privilege on [default])
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223)
     at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
     at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144)
   ```
   The idea is that we introduce a new configuration parameter where we can set 
a different database name for the default database. Our user has enough 
permissions for this  in Ranger.
   
   For example:
   
   ```spark-shell --conf 
spark.sql.catalog.spark_catalog.defaultDatabase=other_db```
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   There will be a new configuration parameter as I mentioned above but the 
default value is "default" as it was previously.
   
   
   ### How was this patch tested?
   
   1) With github action (all tests passed)
   
   https://github.com/roczei/spark/actions/runs/2935626152
   
   2) Manually tested with Ranger + Hive
   
   Scenario a) hrt_10 does not have access to the default database in Hive: 
   
   
   ```
   [hrt_10@quasar-thbnqr-2 ~]$ spark-shell
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   22/08/26 18:14:18 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.masking.algo does not exist
   22/08/26 18:14:30 WARN  cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
[dispatcher-event-loop-17]: Attempted to request executors before the AM has 
registered!
   
   
   ...
   
   scala> spark.sql("use other")
   22/08/26 18:18:47 INFO  conf.HiveConf: [main]: Found configuration file 
file:/etc/hive/conf/hive-site.xml
   22/08/26 18:18:48 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.masking.algo does not exist
   22/08/26 18:18:48 WARN  client.HiveClientImpl: [main]: Detected HiveConf 
hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless 
hive logic
   Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d
   22/08/26 18:18:48 INFO  SessionState: [main]: Hive Session ID = 
2188764e-d0dc-41b3-b714-f89b03cb3d6d
   22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: HMS client 
filtering is enabled.
   22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: Trying to 
connect to metastore with URI 
thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083
   22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: HMSC::open(): 
Could not find delegation token. Creating KERBEROS-based thrift connection.
   22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: Opened a 
connection to metastore, current connections: 1
   22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: Connected to 
metastore.
   22/08/26 18:18:50 INFO  metastore.RetryingMetaStoreClient: [main]: 
RetryingMetaStoreClient proxy=class 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient 
[email protected] (auth:KERBEROS) retries=1 delay=1 lifetime=0
   org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:Permission denied: user [hrt_10] does not have [USE] 
privilege on [default])
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223)
     at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
     at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144)
     at 
org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:179)
   ```
   
   This is the expected behavior because it will use the "default" db name.
   
   Scenario b) Use the "other" database where the hrt_10 user has proper 
permissions  
   
   ```
   [hrt_10@quasar-thbnqr-2 ~]$ spark3-shell --conf 
spark.sql.catalog.spark_catalog.defaultDatabase=other
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   22/08/26 18:27:03 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.masking.algo does not exist
   22/08/26 18:27:14 WARN  cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
[dispatcher-event-loop-15]: Attempted to request executors before the AM has 
registered!
   
   ...
   
   scala> spark.sql("use other")
   22/08/26 18:29:22 INFO  conf.HiveConf: [main]: Found configuration file 
file:/etc/hive/conf/hive-site.xml
   22/08/26 18:29:22 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.masking.algo does not exist
   22/08/26 18:29:22 WARN  client.HiveClientImpl: [main]: Detected HiveConf 
hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless 
hive logic
   Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2
   22/08/26 18:29:22 INFO  SessionState: [main]: Hive Session ID = 
47721693-dbfe-4760-80f6-d4a76a3b37d2
   22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: HMS client 
filtering is enabled.
   22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: Trying to 
connect to metastore with URI 
thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083
   22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: HMSC::open(): 
Could not find delegation token. Creating KERBEROS-based thrift connection.
   22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: Opened a 
connection to metastore, current connections: 1
   22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: Connected to 
metastore.
   22/08/26 18:29:24 INFO  metastore.RetryingMetaStoreClient: [main]: 
RetryingMetaStoreClient proxy=class 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient 
[email protected] (auth:KERBEROS) retries=1 delay=1 lifetime=0
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("select * from employee").show()
   +---+----+------+-----------+                                                
   
   |eid|name|salary|destination|
   +---+----+------+-----------+
   | 12| Ram|    10|     Szeged|
   | 13| Joe|    20|   Debrecen|
   +---+----+------+-----------+
   
   
   scala>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] roczei opened a new pull request, #37679: [SPARK-35242][FOLLOWUP][Ranger][Hive][default db] Spark should not rely on the 'default' hive database.

Reply via email to