[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

GitBox Fri, 16 Jul 2021 13:04:28 -0700


kbendick commented on a change in pull request #2792:
URL: https://github.com/apache/iceberg/pull/2792#discussion_r671497047




##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkUtil.java
##########
@@ -170,4 +174,31 @@ public static boolean 
useTimestampWithoutZoneInNewTables(RuntimeConfig sessionCo
     return false;
   }
 
+  /**
+   * Pulls any Catalog specific overrides for the Hadoop conf from the current 
SparkSession, which can be
+   * set via spark.sql.catalog.$catalogName.hadoop.*
+   *
+   * The SparkCatalog allows for hadoop configurations to be overridden per 
catalog, by setting
+   * them on the SQLConf, where the following will add the property 
"fs.default.name" with value
+   * "hdfs://hanksnamenode:8020" to the catalog's hadoop configuration.
+   *   SparkSession.builder()
+   *     .config(s"spark.sql.catalog.$catalogName.hadoop.fs.default.name", 
"hdfs://hanksnamenode:8020")
+   *     .getOrCreate()
+   * @param spark The current Spark session
+   * @param catalogName Name of the catalog to find overrides for.
+   * @return the Hadoop Configuration that should be used for this catalog, 
with catalog specific overrides applied.
+   */
+  public static Configuration hadoopConfCatalogOverrides(SparkSession spark, 
String catalogName) {
+    // Find keys for the catalog intended to be hadoop configurations
+    final String hadoopConfCatalogPrefix = String.format("%s.%s.%s", 
SPARK_CATALOG_CONF_PREFIX, catalogName, "hadoop.");

Review comment:
       > I "hadoop." should be a private static final String.
   
   Happy to update that.
   
   > Still not fan of using hadoop. instead of override. as mentioned on the 
other issue, since we will need to have the similar things for Hive as well for 
hive configurations too. This again will cause confusion for users who are 
using Hive and Spark as well.
   
   As for using `hadoop.` instead of `override.`, I do think that for spark 
users this is the most natural way to do it as it aligns with the way that 
spark offers to use `spark.hadop.*`, which many Spark users naturally look for.
   
   As soon as we determine a better way to make it more generic across 
catalogs, I'm happy to champion either deprecation efforts of this, or to 
assist in any way I can with ensuring that both methods work for a reasonable 
period of time (whatever is decided).
   
   Unfortunately, for now, I'm not sure if we manage to find any consensus 
around `override.` that would work with other catalogs. I'm happy to help look 
and try other methods in the source code.
   
   As soon as we find something, I'm happy to help with implementing it, 
reviewing it, and ensuring the messaging gets out to the community as best as 
possible.
   
   I would personally prefer to keep this version for Spark, as again this 
correlates really nicely with `spark.hadoop.*` which many Spark users naturally 
expect. So I would personally advocate for maintaining both methods, as spark 
users already have this method to override their job level hadoop 
configurations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #2792: [SPARK] Allow spark catalogs to have hadoop configuration overrides p…

Reply via email to