hudi-agent commented on code in PR #18650:
URL: https://github.com/apache/hudi/pull/18650#discussion_r3485072995
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -1033,9 +1034,87 @@ object DataSourceOptionsHelper {
private val log = LoggerFactory.getLogger(DataSourceOptionsHelper.getClass)
// Prefix constants for config normalization
+ private val HOODIE_PREFIX = "hoodie."
private val SPARK_HOODIE_PREFIX = "spark.hoodie."
private val SPARK_PREFIX = "spark."
+ /**
+ * Collects `hoodie.*` and `spark.hoodie.*` configs from the SparkConf,
normalizes the
+ * `spark.hoodie.*` keys to canonical `hoodie.*`, and merges with explicit
DataFrame
+ * options. Explicit options win over SparkConf.
+ *
+ * This is the read-path entry point: reads have always picked up
session-level `hoodie.*`
+ * confs (e.g. `hoodie.datasource.query.type`), so both prefixes are
forwarded here.
+ * Do NOT use this for writes — see `collectSparkHoodieConfs` for why
ambient `hoodie.*`
+ * confs must not be forwarded to the write path.
+ *
+ * Example (SparkConf has both prefixes set; explicit options override):
+ * {{{
+ * SparkConf: spark.hoodie.X = "a", hoodie.Y = "b"
+ * optParams: hoodie.X = "c"
+ * result: hoodie.X = "c" // explicit wins over both prefixes
+ * hoodie.Y = "b"
+ * }}}
+ */
+ def collectHoodieAndSparkHoodieConfs(sqlContext: SQLContext,
+ optParams: Map[String, String]):
Map[String, String] =
+ collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = true)
+
+ /**
+ * Collects only `spark.hoodie.*` configs from the SparkConf, normalizes
them to canonical
+ * `hoodie.*`, and merges with explicit DataFrame options. Explicit options
win over SparkConf.
+ *
+ * This is the write-path entry point. It deliberately does NOT forward bare
`hoodie.*`
+ * session confs: unlike reads, the DataFrame write path historically
honored only the
+ * explicit `.option(...)` map, so injecting ambient `hoodie.*` session
state (e.g. a
+ * session-level `hoodie.datasource.write.operation` or
`hoodie.logfile.data.block.format`)
+ * would silently change every `df.write`. The bug this addresses
(HUDI-#18649) is about
+ * `--conf spark.hoodie.X=Y` being dropped on writes, which only requires
forwarding the
+ * `spark.hoodie.*` form.
+ *
+ * Example:
+ * {{{
+ * SparkConf: spark.hoodie.X = "a", hoodie.Y = "b" // bare hoodie.Y is
NOT forwarded
+ * optParams: hoodie.Z = "c"
+ * result: hoodie.X = "a"
+ * hoodie.Z = "c"
+ * }}}
+ */
+ def collectSparkHoodieConfs(sqlContext: SQLContext,
+ optParams: Map[String, String]): Map[String,
String] =
+ collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = false)
+
+ private def collectConfsByPrefix(sqlContext: SQLContext,
+ optParams: Map[String, String],
+ includeHoodiePrefix: Boolean): Map[String,
String] = {
+ val sparkConfs = sqlContext.getAllConfs.filter {
Review Comment:
🤖 nit: `sparkConfs` is a bit misleading here — when `includeHoodiePrefix =
true` the filtered result also contains plain `hoodie.*` keys that don't carry
any Spark prefix. Something like `hoodieSessionConfs` or `filteredConfs` would
better describe what's actually in the map (especially since the very next line
passes it to `normalizeSparkHoodiePrefix`, which a reader would find odd for a
variable named "sparkConfs").
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -159,11 +156,18 @@ class DefaultSource extends RelationProvider
mode: SaveMode,
optParams: Map[String, String],
df: DataFrame): BaseRelation = {
+ // Pull `hoodie.*` and `spark.hoodie.*` from SparkConf, normalize
`spark.hoodie.*` to
+ // canonical `hoodie.*`, and merge with explicit options (explicit options
win). This
+ // mirrors what the read createRelation already does, so configs like
+ // `--conf spark.hoodie.datasource.hive_sync.use_spark_catalog=true` are
honored on
+ // writes too. `HoodieSparkSqlWriter` and downstream callers see only
canonical keys.
+ val effectiveOpts =
+ DataSourceOptionsHelper.collectHoodieAndSparkHoodieConfs(sqlContext,
optParams)
Review Comment:
🤖 In the current revision this looks intentional and is now enforced by two
separate helpers: the write path calls `collectSparkHoodieConfs`
(`includeHoodiePrefix = false`), whose filter matches `spark.hoodie.*` only and
explicitly drops bare `hoodie.*` — so an ambient session-level
`hoodie.metadata.enable` won't leak into writes. The both-prefix
`collectHoodieAndSparkHoodieConfs` is used solely on the read path, where
forwarding bare `hoodie.*` matches the pre-existing behavior. So the asymmetry
you're asking about is by design rather than a broadened write-path change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]