emcegom commented on issue #13805:
URL: https://github.com/apache/hudi/issues/13805#issuecomment-3240642536
Hi @rangareddy
We are facing a similar issue and would like to use the catalog-based
approach for Hudi table operations. Currently, we manage Hudi metadata through
the HMS. However, in some of our production use cases we have a requirement for
cross-cluster data queries in single spark session, which makes multi-catalog
integration necessary.
For example, Iceberg supports multiple Hive Metastore catalogs with a
configuration like the following:
`
// Iceberg multi-catalog example
String anotherHiveMetastoreURI = "thrift://another-ip:another-port";
SparkConf sparkConf = new SparkConf()
.set("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkSessionCatalog")
.set("spark.sql.catalog.spark_catalog.type", "hive")
.set("spark.sql.catalog.spark_catalog.default-namespace",
defaultDatabase)
.set("spark.sql.catalog.spark_catalog.uri", hiveMetastoreURI)
.set("spark.sql.catalog.spark_catalog.warehouse", warehouse)
.set("spark.sql.catalog.spark_catalog.hadoop.fs.s3a.access.key",
"<access.key>")
.set("spark.sql.catalog.spark_catalog.hadoop.fs.s3a.secret.key",
"<secret.key>")
.set("spark.sql.catalog.spark_catalog.hadoop.fs.s3a.endpoint",
"http://minio-ip-address:port")
.set("spark.sql.catalog.spark_catalog.hadoop.metastore.catalog.default",
defaultCatalogName)
.set("spark.default.parallelism", "1")
.set(METASTOREURIS.varname, hiveMetastoreURI)
.set("metastore.catalog.default", defaultCatalogName)
.set("spark.sql.catalog." + anotherCatalogMappingName,
"org.apache.iceberg.spark.SparkCatalog")
.set("spark.sql.catalog." + anotherCatalogMappingName + ".type",
"hive")
.set("spark.sql.catalog." + anotherCatalogMappingName +
".default-namespace", "default")
.set("spark.sql.catalog." + anotherCatalogMappingName + ".uri",
anotherHiveMetastoreURI)
.set("spark.sql.catalog." + anotherCatalogMappingName +
".warehouse", warehouse)
.set("spark.sql.catalog." + anotherCatalogMappingName +
".hadoop.fs.s3a.access.key", "<another.access.key>")
.set("spark.sql.catalog." + anotherCatalogMappingName +
".hadoop.fs.s3a.secret.key", "<another.secret.key>")
.set("spark.sql.catalog." + anotherCatalogMappingName +
".hadoop.fs.s3a.endpoint", "http://another-minio.ip-address:another.port")
.set("spark.sql.catalog." + anotherCatalogMappingName +
".hadoop.metastore.catalog.default", "another_catalog");
`
Is there a similar way to achieve multi-catalog integration with Hudi in
Spark 3.3.1 + Hudi 0.15?
Or is there any recommended best practice for such cross-cluster scenarios?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]