kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671497047
##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean
useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
return false;
}
+ /**
+ * Pulls any Catalog specific overrides for the Hadoop conf from the current
SparkSession, which can be
+ * set via spark.sql.catalog.$catalogName.hadoop.*
+ *
+ * The SparkCatalog allows for hadoop configurations to be overridden per
catalog, by setting
+ * them on the SQLConf, where the following will add the property
"fs.default.name" with value
+ * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+ * SparkSession.builder()
+ * .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name",
"hdfs://hanksnamenode:8020")
+ * .getOrCreate()
+ * @param spark The current Spark session
+ * @param catalogName Name of the catalog to find overrides for.
+ * @return the Hadoop Configuration that should be used for this catalog,
with catalog specific overrides applied.
+ */
+ public static Configuration hadoopConfCatalogOverrides(SparkSession spark,
String catalogName) {
+ // Find keys for the catalog intended to be hadoop configurations
+ final String hadoopConfCatalogPrefix = String.format("%s.%s.%s",
SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");
Review comment:
> I "hadoop." should be a private static final String.
Happy to update that.
> Still not fan of using hadoop. instead of override. as mentioned on the
other issue, since we will need to have the similar things for Hive as well for
hive configurations too. This again will cause confusion for users who are
using Hive and Spark as well.
As for using `hadoop.` instead of `override.`, I do think that for spark
users this is the most natural way to do it as it aligns with the way that
spark offers to use `spark.hadop.*`, which many Spark users naturally look for.
As soon as we determine a better way to make it more generic across
catalogs, I'm happy to champion either deprecation efforts of this, or to
assist in any way I can with ensuring that both methods work for a reasonable
period of time (whatever is decided).
Unfortunately, for now, I'm not sure if we manage to find any consensus
around `override.` that would work with other catalogs. I'm happy to help look
and try other methods in the source code.
As soon as we find something, I'm happy to help with implementing it,
reviewing it, and ensuring the messaging gets out to the community as best as
possible.
I would personally prefer to keep this version for Spark, as again this
correlates really nicely with `spark.hadoop.*` which many Spark users naturally
expect. So I would personally advocate for maintaining both methods, as spark
users already have this method to override their job level hadoop
configurations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]