Kuldeep Kulkarni created HUDI-6868:
--------------------------------------
Summary: Hudi HiveSync doesn't support extracting passwords from
credential store
Key: HUDI-6868
URL: https://issues.apache.org/jira/browse/HUDI-6868
Project: Apache Hudi
Issue Type: Bug
Components: hive, hudi-utilities, spark
Reporter: Kuldeep Kulkarni
Attachments: pyspark_hudi_test.py
We have a customer use-case of running PySpark on [Dataproc
Serverless|https://cloud.google.com/dataproc-serverless/docs/overview] with
hudi-spark3-bundle, PySpark job fails to sync Hudi table with HMS DB(remote
CloudSQL instance) due to not able to extract the password from the credential
store.
Same job works fine if we mention metstore DB user password instead of
credential store.
Checking
[code|https://github.com/apache/hudi/blob/release-0.12.3/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfig.java]
for HiveSync configs or
[HiveSyncConfigHolder|https://github.com/apache/hudi/blob/73c2167566730a76a0650d488511253ebc66156f/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java#L44],
I don't see any option where it detects credential store for extracting
passwords. Something like [this
code|https://github.com/apache/hive/blob/rel/release-2.3.9/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L482]
from HMS ObjectStore.
[Hive Sync Config Document|https://hudi.apache.org/docs/syncing_metastore/]
also doesn't have any reference of using credential store.
In order to find the password through the Hadoop Credential Provider API, it
would need to make a call to
[`Configuration#getPassword(String)`|https://hadoop.apache.org/docs/r3.3.6/api/org/apache/hadoop/conf/Configuration.html#getPassword-java.lang.String-].
We don't see anywhere in the Hudi codebase calling "getPassword"
*Repro steps:*
Sample PySpark script - Attached.
Command with successful job execution with Metastore DB password:
{code:java}
gcloud dataproc batches submit --version 1.1 --container-image
gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark
gs://<gcs-bucket>/pyspark_hudi_test.py
--jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties
"spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.javax.jdo.option.ConnectionPassword=<hive-db-user-password>"
--deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/*
{code}
Failing command ( with credential store):
{code:java}
gcloud dataproc batches submit --version 1.1 --container-image
gcr.io/<container-repo>/new-custom-debian:v4 --region <region> pyspark
gs://<gcs-bucket>/pyspark_hudi_test.py
--jars="gs://<gcs-bucket>/hudi-spark3-bundle_2.12-0.12.3.jar" --properties
"spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:mysql://<cloud-sql-HMS-DB-IP>:3306/hive_metastore,spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver,spark.hadoop.javax.jdo.option.ConnectionUserName=hive,spark.hadoop.hadoop.security.credential.provider.path=jceks://gs@<gcs-bucket>/metastore-pass-v2.jceks"
--deps-bucket gs://<gcs-bucket> -- SPARK_EXTRA_CLASSPATH=/opt/spark/jars/*
{code}
Error:
{code:java}
23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Commit 20230911042953444
successful!
23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.inlineCompactionEnabled ?
false
23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Compaction Scheduled is
Optional.empty
23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Config.asyncClusteringEnabled ?
false
23/09/11 04:30:42 INFO HoodieSparkSqlWriter$: Clustering Scheduled is
Optional.empty
23/09/11 04:30:42 INFO HiveConf: Found configuration file null
[..]
23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from gs://<gcs-bucket>/
23/09/11 04:30:42 INFO HoodieTableConfig: Loading table properties from
gs://<gcs-bucket>/.hoodie/hoodie.properties
23/09/11 04:30:42 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs://<gcs-bucket>/
23/09/11 04:30:42 INFO HoodieTableMetaClient: Loading Active commit timeline
for gs://<gcs-bucket>/
23/09/11 04:30:42 INFO HoodieActiveTimeline: Loaded instants upto :
Option\{val=[20230911042953444__commit__COMPLETED]}
23/09/11 04:30:43 INFO HiveMetaStore: 0: Opening raw store with implementation
class:org.apache.hadoop.hive.metastore.ObjectStore
23/09/11 04:30:43 INFO ObjectStore: ObjectStore, initialize called
23/09/11 04:30:44 INFO Persistence: Property datanucleus.cache.level2 unknown -
will be ignored
Mon Sep 11 04:30:44 UTC 2023 WARN: Establishing SSL connection without server's
identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+
and 5.7.6+ requirements SSL connection must be established by default if
explicit option isn't set. For compliance with existing applications not using
SSL the verifyServerCertificate property is set to 'false'. You need either to
explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide
truststore for server certificate verification.
[..]
Unable to open a test connection to the given database. JDBC url =
jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive.
Terminating connection pool (set lazyInit to true if you expect to start your
database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>'
(using password: YES)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3933)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3869)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:864)
at
com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1707)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1217)
[..]
------
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test
connection to the given database. JDBC url =
jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore, username = hive.
Terminating connection pool (set lazyInit to true if you expect to start your
database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>'
(using password: YES)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
[..]
Caused by: java.sql.SQLException: Unable to open a test connection to the given
database. JDBC url = jdbc:mysql://<cloud-sql-HMS-db-ip>:3306/hive_metastore,
username = hive. Terminating connection pool (set lazyInit to true if you
expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'hive'@'<cloud-sql-HMS-db-ip>'
(using password: YES)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:965)
{code}
*Note* - metastore-pass-v2.jceks in above example contains value of
"javax.jdo.option.ConnectionPassword" and there is no issue with it. It works
fine with this credential store for other pyspark jobs(without Hudi of course)
We tried with "hudi-spark3-bundle_2.12-0.13.1.jar" as well, it did not help.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)