Re: [PR] [MINOR] Forward spark.hoodie.* SparkConf to write path (parity with read path) [hudi]

via GitHub Fri, 26 Jun 2026 19:59:32 -0700


hudi-agent commented on code in PR #18650:
URL: https://github.com/apache/hudi/pull/18650#discussion_r3485072995



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -1033,9 +1034,87 @@ object DataSourceOptionsHelper {
   private val log = LoggerFactory.getLogger(DataSourceOptionsHelper.getClass)
 
   // Prefix constants for config normalization
+  private val HOODIE_PREFIX = "hoodie."
   private val SPARK_HOODIE_PREFIX = "spark.hoodie."
   private val SPARK_PREFIX = "spark."
 
+  /**
+   * Collects `hoodie.*` and `spark.hoodie.*` configs from the SparkConf, 
normalizes the
+   * `spark.hoodie.*` keys to canonical `hoodie.*`, and merges with explicit 
DataFrame
+   * options. Explicit options win over SparkConf.
+   *
+   * This is the read-path entry point: reads have always picked up 
session-level `hoodie.*`
+   * confs (e.g. `hoodie.datasource.query.type`), so both prefixes are 
forwarded here.
+   * Do NOT use this for writes — see `collectSparkHoodieConfs` for why 
ambient `hoodie.*`
+   * confs must not be forwarded to the write path.
+   *
+   * Example (SparkConf has both prefixes set; explicit options override):
+   * {{{
+   *   SparkConf:  spark.hoodie.X = "a", hoodie.Y = "b"
+   *   optParams:  hoodie.X = "c"
+   *   result:     hoodie.X = "c"   // explicit wins over both prefixes
+   *               hoodie.Y = "b"
+   * }}}
+   */
+  def collectHoodieAndSparkHoodieConfs(sqlContext: SQLContext,
+                                       optParams: Map[String, String]): 
Map[String, String] =
+    collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = true)
+
+  /**
+   * Collects only `spark.hoodie.*` configs from the SparkConf, normalizes 
them to canonical
+   * `hoodie.*`, and merges with explicit DataFrame options. Explicit options 
win over SparkConf.
+   *
+   * This is the write-path entry point. It deliberately does NOT forward bare 
`hoodie.*`
+   * session confs: unlike reads, the DataFrame write path historically 
honored only the
+   * explicit `.option(...)` map, so injecting ambient `hoodie.*` session 
state (e.g. a
+   * session-level `hoodie.datasource.write.operation` or 
`hoodie.logfile.data.block.format`)
+   * would silently change every `df.write`. The bug this addresses 
(HUDI-#18649) is about
+   * `--conf spark.hoodie.X=Y` being dropped on writes, which only requires 
forwarding the
+   * `spark.hoodie.*` form.
+   *
+   * Example:
+   * {{{
+   *   SparkConf:  spark.hoodie.X = "a", hoodie.Y = "b"   // bare hoodie.Y is 
NOT forwarded
+   *   optParams:  hoodie.Z = "c"
+   *   result:     hoodie.X = "a"
+   *               hoodie.Z = "c"
+   * }}}
+   */
+  def collectSparkHoodieConfs(sqlContext: SQLContext,
+                              optParams: Map[String, String]): Map[String, 
String] =
+    collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = false)
+
+  private def collectConfsByPrefix(sqlContext: SQLContext,
+                                   optParams: Map[String, String],
+                                   includeHoodiePrefix: Boolean): Map[String, 
String] = {
+    val sparkConfs = sqlContext.getAllConfs.filter {

Review Comment:
   🤖 nit: `sparkConfs` is a bit misleading here — when `includeHoodiePrefix = 
true` the filtered result also contains plain `hoodie.*` keys that don't carry 
any Spark prefix. Something like `hoodieSessionConfs` or `filteredConfs` would 
better describe what's actually in the map (especially since the very next line 
passes it to `normalizeSparkHoodiePrefix`, which a reader would find odd for a 
variable named "sparkConfs").
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -159,11 +156,18 @@ class DefaultSource extends RelationProvider
                               mode: SaveMode,
                               optParams: Map[String, String],
                               df: DataFrame): BaseRelation = {
+    // Pull `hoodie.*` and `spark.hoodie.*` from SparkConf, normalize 
`spark.hoodie.*` to
+    // canonical `hoodie.*`, and merge with explicit options (explicit options 
win). This
+    // mirrors what the read createRelation already does, so configs like
+    // `--conf spark.hoodie.datasource.hive_sync.use_spark_catalog=true` are 
honored on
+    // writes too. `HoodieSparkSqlWriter` and downstream callers see only 
canonical keys.
+    val effectiveOpts =
+      DataSourceOptionsHelper.collectHoodieAndSparkHoodieConfs(sqlContext, 
optParams)

Review Comment:
   🤖 In the current revision this looks intentional and is now enforced by two 
separate helpers: the write path calls `collectSparkHoodieConfs` 
(`includeHoodiePrefix = false`), whose filter matches `spark.hoodie.*` only and 
explicitly drops bare `hoodie.*` — so an ambient session-level 
`hoodie.metadata.enable` won't leak into writes. The both-prefix 
`collectHoodieAndSparkHoodieConfs` is used solely on the read path, where 
forwarding bare `hoodie.*` matches the pre-existing behavior. So the asymmetry 
you're asking about is by design rather than a broadened write-path change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [MINOR] Forward spark.hoodie.* SparkConf to write path (parity with read path) [hudi]

Reply via email to