hudi-agent commented on code in PR #18650:
URL: https://github.com/apache/hudi/pull/18650#discussion_r3463321966


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -1033,9 +1034,87 @@ object DataSourceOptionsHelper {
   private val log = LoggerFactory.getLogger(DataSourceOptionsHelper.getClass)
 
   // Prefix constants for config normalization
+  private val HOODIE_PREFIX = "hoodie."
   private val SPARK_HOODIE_PREFIX = "spark.hoodie."
   private val SPARK_PREFIX = "spark."
 
+  /**
+   * Collects `hoodie.*` and `spark.hoodie.*` configs from the SparkConf, 
normalizes the
+   * `spark.hoodie.*` keys to canonical `hoodie.*`, and merges with explicit 
DataFrame
+   * options. Explicit options win over SparkConf.
+   *
+   * This is the read-path entry point: reads have always picked up 
session-level `hoodie.*`
+   * confs (e.g. `hoodie.datasource.query.type`), so both prefixes are 
forwarded here.
+   * Do NOT use this for writes — see `collectSparkHoodieConfs` for why 
ambient `hoodie.*`
+   * confs must not be forwarded to the write path.
+   *
+   * Example (SparkConf has both prefixes set; explicit options override):
+   * {{{
+   *   SparkConf:  spark.hoodie.X = "a", hoodie.Y = "b"
+   *   optParams:  hoodie.X = "c"
+   *   result:     hoodie.X = "c"   // explicit wins over both prefixes
+   *               hoodie.Y = "b"
+   * }}}
+   */
+  def collectHoodieAndSparkHoodieConfs(sqlContext: SQLContext,
+                                       optParams: Map[String, String]): 
Map[String, String] =
+    collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = true)
+
+  /**
+   * Collects only `spark.hoodie.*` configs from the SparkConf, normalizes 
them to canonical
+   * `hoodie.*`, and merges with explicit DataFrame options. Explicit options 
win over SparkConf.
+   *
+   * This is the write-path entry point. It deliberately does NOT forward bare 
`hoodie.*`
+   * session confs: unlike reads, the DataFrame write path historically 
honored only the
+   * explicit `.option(...)` map, so injecting ambient `hoodie.*` session 
state (e.g. a
+   * session-level `hoodie.datasource.write.operation` or 
`hoodie.logfile.data.block.format`)
+   * would silently change every `df.write`. The bug this addresses 
(HUDI-#18649) is about
+   * `--conf spark.hoodie.X=Y` being dropped on writes, which only requires 
forwarding the
+   * `spark.hoodie.*` form.
+   *
+   * Example:
+   * {{{
+   *   SparkConf:  spark.hoodie.X = "a", hoodie.Y = "b"   // bare hoodie.Y is 
NOT forwarded
+   *   optParams:  hoodie.Z = "c"
+   *   result:     hoodie.X = "a"
+   *               hoodie.Z = "c"
+   * }}}
+   */
+  def collectSparkHoodieConfs(sqlContext: SQLContext,
+                              optParams: Map[String, String]): Map[String, 
String] =
+    collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = false)
+
+  private def collectConfsByPrefix(sqlContext: SQLContext,
+                                   optParams: Map[String, String],
+                                   includeHoodiePrefix: Boolean): Map[String, 
String] = {
+    val sparkConfs = sqlContext.getAllConfs.filter {

Review Comment:
   🤖 nit: `sparkConfs` reads like Spark infrastructure configs (e.g. 
`spark.executor.memory`), but this variable actually holds hoodie-relevant 
entries from the SQL session — including bare `hoodie.*` keys when 
`includeHoodiePrefix = true`. Something like `hoodieSessionConfs` or 
`filteredSessionConfs` might make that clearer.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -173,11 +170,20 @@ class DefaultSource extends RelationProvider
                               mode: SaveMode,
                               optParams: Map[String, String],
                               df: DataFrame): BaseRelation = {
+    // Pull `spark.hoodie.*` from SparkConf, normalize to canonical 
`hoodie.*`, and merge

Review Comment:
   🤖 nit: this 7-line block repeats almost verbatim what 
`collectSparkHoodieConfs`'s Javadoc already covers. A single line — e.g. `// 
write-path: forward spark.hoodie.* from SparkConf but not ambient hoodie.* 
session confs` — would be enough for a reader scanning this method.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -159,11 +156,18 @@ class DefaultSource extends RelationProvider
                               mode: SaveMode,
                               optParams: Map[String, String],
                               df: DataFrame): BaseRelation = {
+    // Pull `hoodie.*` and `spark.hoodie.*` from SparkConf, normalize 
`spark.hoodie.*` to
+    // canonical `hoodie.*`, and merge with explicit options (explicit options 
win). This
+    // mirrors what the read createRelation already does, so configs like
+    // `--conf spark.hoodie.datasource.hive_sync.use_spark_catalog=true` are 
honored on
+    // writes too. `HoodieSparkSqlWriter` and downstream callers see only 
canonical keys.
+    val effectiveOpts =
+      DataSourceOptionsHelper.collectHoodieAndSparkHoodieConfs(sqlContext, 
optParams)

Review Comment:
   🤖 Worth noting the read path already does exactly this — the filter in 
`DefaultSource.scala:110-111` uses the identical `key.startsWith("hoodie.") || 
key.startsWith("spark.hoodie.")` predicate, so plain `hoodie.*` SparkConf 
entries are already applied on reads today. Forwarding them on writes via the 
shared helper looks like the intended parity rather than a new behavior, though 
it'd be good for the author to confirm that plain `hoodie.*` (not just 
`spark.hoodie.*`) coverage is deliberate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to