hudi-agent commented on code in PR #18650:
URL: https://github.com/apache/hudi/pull/18650#discussion_r3463321966
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -1033,9 +1034,87 @@ object DataSourceOptionsHelper {
private val log = LoggerFactory.getLogger(DataSourceOptionsHelper.getClass)
// Prefix constants for config normalization
+ private val HOODIE_PREFIX = "hoodie."
private val SPARK_HOODIE_PREFIX = "spark.hoodie."
private val SPARK_PREFIX = "spark."
+ /**
+ * Collects `hoodie.*` and `spark.hoodie.*` configs from the SparkConf,
normalizes the
+ * `spark.hoodie.*` keys to canonical `hoodie.*`, and merges with explicit
DataFrame
+ * options. Explicit options win over SparkConf.
+ *
+ * This is the read-path entry point: reads have always picked up
session-level `hoodie.*`
+ * confs (e.g. `hoodie.datasource.query.type`), so both prefixes are
forwarded here.
+ * Do NOT use this for writes — see `collectSparkHoodieConfs` for why
ambient `hoodie.*`
+ * confs must not be forwarded to the write path.
+ *
+ * Example (SparkConf has both prefixes set; explicit options override):
+ * {{{
+ * SparkConf: spark.hoodie.X = "a", hoodie.Y = "b"
+ * optParams: hoodie.X = "c"
+ * result: hoodie.X = "c" // explicit wins over both prefixes
+ * hoodie.Y = "b"
+ * }}}
+ */
+ def collectHoodieAndSparkHoodieConfs(sqlContext: SQLContext,
+ optParams: Map[String, String]):
Map[String, String] =
+ collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = true)
+
+ /**
+ * Collects only `spark.hoodie.*` configs from the SparkConf, normalizes
them to canonical
+ * `hoodie.*`, and merges with explicit DataFrame options. Explicit options
win over SparkConf.
+ *
+ * This is the write-path entry point. It deliberately does NOT forward bare
`hoodie.*`
+ * session confs: unlike reads, the DataFrame write path historically
honored only the
+ * explicit `.option(...)` map, so injecting ambient `hoodie.*` session
state (e.g. a
+ * session-level `hoodie.datasource.write.operation` or
`hoodie.logfile.data.block.format`)
+ * would silently change every `df.write`. The bug this addresses
(HUDI-#18649) is about
+ * `--conf spark.hoodie.X=Y` being dropped on writes, which only requires
forwarding the
+ * `spark.hoodie.*` form.
+ *
+ * Example:
+ * {{{
+ * SparkConf: spark.hoodie.X = "a", hoodie.Y = "b" // bare hoodie.Y is
NOT forwarded
+ * optParams: hoodie.Z = "c"
+ * result: hoodie.X = "a"
+ * hoodie.Z = "c"
+ * }}}
+ */
+ def collectSparkHoodieConfs(sqlContext: SQLContext,
+ optParams: Map[String, String]): Map[String,
String] =
+ collectConfsByPrefix(sqlContext, optParams, includeHoodiePrefix = false)
+
+ private def collectConfsByPrefix(sqlContext: SQLContext,
+ optParams: Map[String, String],
+ includeHoodiePrefix: Boolean): Map[String,
String] = {
+ val sparkConfs = sqlContext.getAllConfs.filter {
Review Comment:
🤖 nit: `sparkConfs` reads like Spark infrastructure configs (e.g.
`spark.executor.memory`), but this variable actually holds hoodie-relevant
entries from the SQL session — including bare `hoodie.*` keys when
`includeHoodiePrefix = true`. Something like `hoodieSessionConfs` or
`filteredSessionConfs` might make that clearer.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -173,11 +170,20 @@ class DefaultSource extends RelationProvider
mode: SaveMode,
optParams: Map[String, String],
df: DataFrame): BaseRelation = {
+ // Pull `spark.hoodie.*` from SparkConf, normalize to canonical
`hoodie.*`, and merge
Review Comment:
🤖 nit: this 7-line block repeats almost verbatim what
`collectSparkHoodieConfs`'s Javadoc already covers. A single line — e.g. `//
write-path: forward spark.hoodie.* from SparkConf but not ambient hoodie.*
session confs` — would be enough for a reader scanning this method.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -159,11 +156,18 @@ class DefaultSource extends RelationProvider
mode: SaveMode,
optParams: Map[String, String],
df: DataFrame): BaseRelation = {
+ // Pull `hoodie.*` and `spark.hoodie.*` from SparkConf, normalize
`spark.hoodie.*` to
+ // canonical `hoodie.*`, and merge with explicit options (explicit options
win). This
+ // mirrors what the read createRelation already does, so configs like
+ // `--conf spark.hoodie.datasource.hive_sync.use_spark_catalog=true` are
honored on
+ // writes too. `HoodieSparkSqlWriter` and downstream callers see only
canonical keys.
+ val effectiveOpts =
+ DataSourceOptionsHelper.collectHoodieAndSparkHoodieConfs(sqlContext,
optParams)
Review Comment:
🤖 Worth noting the read path already does exactly this — the filter in
`DefaultSource.scala:110-111` uses the identical `key.startsWith("hoodie.") ||
key.startsWith("spark.hoodie.")` predicate, so plain `hoodie.*` SparkConf
entries are already applied on reads today. Forwarding them on writes via the
shared helper looks like the intended parity rather than a new behavior, though
it'd be good for the author to confirm that plain `hoodie.*` (not just
`spark.hoodie.*`) coverage is deliberate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]