yihua commented on code in PR #18437:
URL: https://github.com/apache/hudi/pull/18437#discussion_r3035019993
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ArchiveCommitsProcedure.scala:
##########
@@ -57,6 +61,15 @@ class ArchiveCommitsProcedure extends BaseProcedure
val maxCommits = getArgValueOrDefault(args,
PARAMETERS(3)).get.asInstanceOf[Int]
val retainCommits = getArgValueOrDefault(args,
PARAMETERS(4)).get.asInstanceOf[Int]
val enableMetadata = getArgValueOrDefault(args,
PARAMETERS(5)).get.asInstanceOf[Boolean]
+ val options = getArgValueOrDefault(args, PARAMETERS(6))
+ var confs: Map[String, String] = Map.empty
+
+ options match {
+ case Some(p) =>
+ confs = confs ++ HoodieCLIUtils.extractOptions(p.asInstanceOf[String])
Review Comment:
🤖 If the user passes a malformed options string that contains a token with
no `=` sign (e.g. `'hoodie.keep.min.commits=2,badtoken'`),
`HoodieCLIUtils.extractOptions` will split that token into a single-element
list and the `pair.get(1)` call inside it will throw
`IndexOutOfBoundsException`, surfacing as a confusing runtime error rather than
a helpful message. Could you add a try/catch or validate the format here (or
inside `extractOptions`) and throw an `IllegalArgumentException` with a clear
message about the expected `key=value` format?
##########
hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/cli/ArchiveExecutorUtils.java:
##########
@@ -53,12 +54,14 @@ public static int archive(JavaSparkContext jsc,
int maxCommits,
int commitsRetained,
boolean enableMetadata,
- String basePath) throws IOException {
+ String basePath,
+ Map<String, String> conf) throws IOException {
HoodieWriteConfig config =
HoodieWriteConfig.newBuilder().withPath(basePath)
.withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(minCommits,
maxCommits).build())
.withCleanConfig(HoodieCleanConfig.newBuilder().retainCommits(commitsRetained).build())
.withEmbeddedTimelineServerEnabled(false)
.withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(enableMetadata).build())
+ .withProps(conf)
.build();
Review Comment:
🤖 Since `withProps(conf)` is called after `archiveCommitsWith`,
`retainCommits`, and `enable(enableMetadata)`, any key in `conf` that overlaps
with those settings (e.g. `hoodie.keep.min.commits`, `hoodie.keep.max.commits`)
will silently override the named parameters. Is that the intended precedence?
If so, it might be worth a short comment here, because callers who pass both
`min_commits => 5` and `hoodie.keep.min.commits=10` in options will see
unexpected behaviour. Alternatively, moving `withProps(conf)` before the other
builder calls would give the named parameters higher priority.
##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ArchiveCommitsProcedure.scala:
##########
@@ -36,7 +38,9 @@ class ArchiveCommitsProcedure extends BaseProcedure
ProcedureParameter.optional(2, "min_commits", DataTypes.IntegerType, 20),
ProcedureParameter.optional(3, "max_commits", DataTypes.IntegerType, 30),
ProcedureParameter.optional(4, "retain_commits", DataTypes.IntegerType,
10),
- ProcedureParameter.optional(5, "enable_metadata", DataTypes.BooleanType,
true)
+ ProcedureParameter.optional(5, "enable_metadata", DataTypes.BooleanType,
false),
Review Comment:
🤖 Looking at the diff, the default value for `enable_metadata` is still
`true` — the only change on that line is the addition of a trailing comma to
accommodate the new `options` parameter. The default wasn't modified.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]