hudi-agent commented on code in PR #18384:
URL: https://github.com/apache/hudi/pull/18384#discussion_r3280027883
##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergeOnReadTableInputFormat.java:
##########
@@ -415,7 +415,10 @@ private static RealtimeFileStatus
createRealtimeFileStatusUnchecked(HoodieLogFil
private static Option<HoodieVirtualKeyInfo>
getHoodieVirtualKeyInfo(HoodieTableMetaClient metaClient) {
HoodieTableConfig tableConfig = metaClient.getTableConfig();
- if (tableConfig.populateMetaFields()) {
+ // Virtual key info recomputes record-key/partition-path from source
fields. It is only
Review Comment:
🤖 The leading comment notes virtual key info "recomputes
record-key/partition-path from source fields" (both), but the check only gates
on `isRecordKeyPopulated()`. If a table has `_hoodie_record_key` populated but
`_hoodie_partition_path` excluded, this returns `Option.empty()` and
`HoodieRealtimeInputFormatUtils.addVirtualKeysProjection` will project the
(null) partition-path meta column. Should this use `!isKeyGeneratorRequired()`
(or check both flags) to match the predicate that the write-side
keygen-allocation sites converged on in
`HoodieMetaFieldFlags.isKeyGeneratorRequired()`?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieMetaFieldFlags.java:
##########
@@ -0,0 +1,294 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.TreeSet;
+
+/**
+ * Encapsulates which individual meta fields should be populated during writes.
+ * Provides named accessor methods for better readability compared to boolean
array indexing.
+ *
+ * <p>The flags correspond to the 5 standard Hudi meta columns in {@link
HoodieRecord#HOODIE_META_COLUMNS}:
+ * commit_time, commit_seqno, record_key, partition_path, file_name.
+ */
+public class HoodieMetaFieldFlags implements Serializable {
+
+ private static final long serialVersionUID = 1L;
+
+ private static final HoodieMetaFieldFlags ALL_POPULATED = new
HoodieMetaFieldFlags(true, true, true, true, true);
+ private static final HoodieMetaFieldFlags NONE_POPULATED = new
HoodieMetaFieldFlags(false, false, false, false, false);
+
+ private final boolean commitTimePopulated;
+ private final boolean commitSeqNoPopulated;
+ private final boolean recordKeyPopulated;
+ private final boolean partitionPathPopulated;
+ private final boolean fileNamePopulated;
+
+ private HoodieMetaFieldFlags(boolean commitTimePopulated, boolean
commitSeqNoPopulated,
+ boolean recordKeyPopulated, boolean
partitionPathPopulated,
+ boolean fileNamePopulated) {
+ this.commitTimePopulated = commitTimePopulated;
+ this.commitSeqNoPopulated = commitSeqNoPopulated;
+ this.recordKeyPopulated = recordKeyPopulated;
+ this.partitionPathPopulated = partitionPathPopulated;
+ this.fileNamePopulated = fileNamePopulated;
+ }
+
+ /**
+ * Returns an instance where all meta fields are populated.
+ *
+ * <p>Package-private: instances should be sourced via
+ * {@link HoodieTableConfig#getHoodieMetaFieldFlags()}. This factory is
exposed only for the
+ * internal {@link #fromConfig} path and same-package tests; external
callers must go through
+ * the table config so the on-disk persisted state is the single source of
truth.
+ */
+ static HoodieMetaFieldFlags allPopulated() {
+ return ALL_POPULATED;
+ }
+
+ /**
+ * Returns an instance where no meta fields are populated.
+ *
+ * <p>Package-private; see {@link #allPopulated()}.
+ */
+ static HoodieMetaFieldFlags nonePopulated() {
+ return NONE_POPULATED;
+ }
+
+ /**
+ * Creates an instance from a set of excluded field names.
+ *
+ * <p>Package-private; see {@link #allPopulated()}.
+ *
+ * @param excluded set of meta field names to exclude (e.g.
"_hoodie_record_key")
+ * @return HoodieMetaFieldFlags with excluded fields marked as not populated
+ * @throws IllegalArgumentException if {@code excluded} contains names that
are not in
+ * {@link HoodieRecord#HOODIE_META_COLUMNS}, so configuration typos
fail fast
+ * rather than being silently ignored.
+ */
+ static HoodieMetaFieldFlags fromExcludedFields(Set<String> excluded) {
+ if (excluded == null || excluded.isEmpty()) {
+ return ALL_POPULATED;
+ }
+ Set<String> unknown = new HashSet<>(excluded);
+ unknown.removeAll(HoodieRecord.HOODIE_META_COLUMNS);
+ if (!unknown.isEmpty()) {
+ throw new IllegalArgumentException(
+ "Unknown meta field name(s) in exclusion list: " + unknown
+ + ". Valid names are: " + HoodieRecord.HOODIE_META_COLUMNS);
+ }
+ return new HoodieMetaFieldFlags(
+ !excluded.contains(HoodieRecord.COMMIT_TIME_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD)
+ );
+ }
+
+ /**
+ * Resolves the flags from a {@link HoodieConfig} (typically the merged write
+ * config or the table config). Reads {@link
HoodieTableConfig#POPULATE_META_FIELDS}
+ * and {@link HoodieTableConfig#META_FIELDS_EXCLUDE_LIST}; the exclusion
list is
+ * a comma-separated list of meta-field names with whitespace tolerated.
+ *
+ * <p><b>Internal only.</b> The intended (and only supported) caller is
+ * {@link HoodieTableConfig#getHoodieMetaFieldFlags()}. All write- and
read-side code
+ * should reach the flags through {@code
HoodieTableConfig.getHoodieMetaFieldFlags()}
+ * so the persisted table state remains the single source of truth.
+ */
+ public static HoodieMetaFieldFlags fromConfig(HoodieConfig config) {
+ boolean populateMetaFields =
config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
+ if (!populateMetaFields) {
+ return NONE_POPULATED;
+ }
+ Set<String> excluded =
parseExcludeList(config.getString(HoodieTableConfig.META_FIELDS_EXCLUDE_LIST));
+ return fromExcludedFields(excluded);
+ }
Review Comment:
🤖 nit: `fromConfig` is declared `public` but the Javadoc says "<b>Internal
only.</b> The intended (and only supported) caller is
`HoodieTableConfig#getHoodieMetaFieldFlags()`". Could this be package-private
like the other factory methods (`allPopulated`, `nonePopulated`,
`fromExcludedFields`) so the access modifier actually enforces the documented
intent and prevents callers from bypassing the table-config gate?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -176,29 +179,40 @@ private void writeRow(InternalRow row) {
// over again)
UTF8String[] metaFields = new UTF8String[5];
UTF8String recordKey =
row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD);
- metaFields[3] =
row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD);
- // This is the only meta-field that is generated dynamically, hence
conversion b/w
- // [[String]] and [[UTF8String]] is unavoidable if
preserveHoodieMetadata is false
- metaFields[1] = shouldPreserveHoodieMetadata ?
row.getUTF8String(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD_ORD)
- :
UTF8String.fromString(seqIdGenerator.apply(GLOBAL_SEQ_NO.getAndIncrement()));
- metaFields[0] = shouldPreserveHoodieMetadata ?
row.getUTF8String(HoodieRecord.COMMIT_TIME_METADATA_FIELD_ORD)
- : commitTime;
- metaFields[2] = recordKey;
- metaFields[4] = fileName;
+ metaFields[0] = metaFieldFlags.isCommitTimePopulated()
Review Comment:
🤖 nit: the nested ternary `isX ? (shouldPreserve ? a : b) : null` is harder
to scan than a plain `if`-block, and the analogous logic in
`HoodieRowDataCreateHandle` uses explicit `if`/`else if`/`else` for the
identical three-way branch. Would it be worth matching that style here for
consistency?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieMetaFieldFlags.java:
##########
@@ -0,0 +1,294 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.exception.HoodieException;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.TreeSet;
+
+/**
+ * Encapsulates which individual meta fields should be populated during writes.
+ * Provides named accessor methods for better readability compared to boolean
array indexing.
+ *
+ * <p>The flags correspond to the 5 standard Hudi meta columns in {@link
HoodieRecord#HOODIE_META_COLUMNS}:
+ * commit_time, commit_seqno, record_key, partition_path, file_name.
+ */
+public class HoodieMetaFieldFlags implements Serializable {
+
+ private static final long serialVersionUID = 1L;
+
+ private static final HoodieMetaFieldFlags ALL_POPULATED = new
HoodieMetaFieldFlags(true, true, true, true, true);
+ private static final HoodieMetaFieldFlags NONE_POPULATED = new
HoodieMetaFieldFlags(false, false, false, false, false);
+
+ private final boolean commitTimePopulated;
+ private final boolean commitSeqNoPopulated;
+ private final boolean recordKeyPopulated;
+ private final boolean partitionPathPopulated;
+ private final boolean fileNamePopulated;
+
+ private HoodieMetaFieldFlags(boolean commitTimePopulated, boolean
commitSeqNoPopulated,
+ boolean recordKeyPopulated, boolean
partitionPathPopulated,
+ boolean fileNamePopulated) {
+ this.commitTimePopulated = commitTimePopulated;
+ this.commitSeqNoPopulated = commitSeqNoPopulated;
+ this.recordKeyPopulated = recordKeyPopulated;
+ this.partitionPathPopulated = partitionPathPopulated;
+ this.fileNamePopulated = fileNamePopulated;
+ }
+
+ /**
+ * Returns an instance where all meta fields are populated.
+ *
+ * <p>Package-private: instances should be sourced via
+ * {@link HoodieTableConfig#getHoodieMetaFieldFlags()}. This factory is
exposed only for the
+ * internal {@link #fromConfig} path and same-package tests; external
callers must go through
+ * the table config so the on-disk persisted state is the single source of
truth.
+ */
+ static HoodieMetaFieldFlags allPopulated() {
+ return ALL_POPULATED;
+ }
+
+ /**
+ * Returns an instance where no meta fields are populated.
+ *
+ * <p>Package-private; see {@link #allPopulated()}.
+ */
+ static HoodieMetaFieldFlags nonePopulated() {
+ return NONE_POPULATED;
+ }
+
+ /**
+ * Creates an instance from a set of excluded field names.
+ *
+ * <p>Package-private; see {@link #allPopulated()}.
+ *
+ * @param excluded set of meta field names to exclude (e.g.
"_hoodie_record_key")
+ * @return HoodieMetaFieldFlags with excluded fields marked as not populated
+ * @throws IllegalArgumentException if {@code excluded} contains names that
are not in
+ * {@link HoodieRecord#HOODIE_META_COLUMNS}, so configuration typos
fail fast
+ * rather than being silently ignored.
+ */
+ static HoodieMetaFieldFlags fromExcludedFields(Set<String> excluded) {
+ if (excluded == null || excluded.isEmpty()) {
+ return ALL_POPULATED;
+ }
+ Set<String> unknown = new HashSet<>(excluded);
+ unknown.removeAll(HoodieRecord.HOODIE_META_COLUMNS);
+ if (!unknown.isEmpty()) {
+ throw new IllegalArgumentException(
+ "Unknown meta field name(s) in exclusion list: " + unknown
+ + ". Valid names are: " + HoodieRecord.HOODIE_META_COLUMNS);
+ }
+ return new HoodieMetaFieldFlags(
+ !excluded.contains(HoodieRecord.COMMIT_TIME_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD),
+ !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD)
+ );
+ }
+
+ /**
+ * Resolves the flags from a {@link HoodieConfig} (typically the merged write
+ * config or the table config). Reads {@link
HoodieTableConfig#POPULATE_META_FIELDS}
+ * and {@link HoodieTableConfig#META_FIELDS_EXCLUDE_LIST}; the exclusion
list is
+ * a comma-separated list of meta-field names with whitespace tolerated.
+ *
+ * <p><b>Internal only.</b> The intended (and only supported) caller is
+ * {@link HoodieTableConfig#getHoodieMetaFieldFlags()}. All write- and
read-side code
+ * should reach the flags through {@code
HoodieTableConfig.getHoodieMetaFieldFlags()}
+ * so the persisted table state remains the single source of truth.
+ */
+ public static HoodieMetaFieldFlags fromConfig(HoodieConfig config) {
+ boolean populateMetaFields =
config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
+ if (!populateMetaFields) {
+ return NONE_POPULATED;
+ }
+ Set<String> excluded =
parseExcludeList(config.getString(HoodieTableConfig.META_FIELDS_EXCLUDE_LIST));
+ return fromExcludedFields(excluded);
+ }
+
+ /**
+ * Parses a comma-separated meta-field exclusion list into a set of trimmed
names.
+ * Returns an empty set for null or blank input. Used as the single parsing
point
+ * for {@link HoodieTableConfig#META_FIELDS_EXCLUDE_LIST} so callers do not
duplicate
+ * split/trim logic.
+ */
+ public static Set<String> parseExcludeList(String value) {
+ Set<String> excluded = new HashSet<>();
+ if (value == null || value.trim().isEmpty()) {
+ return excluded;
+ }
+ for (String field : value.split(",")) {
+ String trimmed = field.trim();
+ if (!trimmed.isEmpty()) {
+ excluded.add(trimmed);
+ }
+ }
+ return excluded;
+ }
+
+ public boolean isCommitTimePopulated() {
+ return commitTimePopulated;
+ }
+
+ public boolean isCommitSeqNoPopulated() {
+ return commitSeqNoPopulated;
+ }
+
+ public boolean isRecordKeyPopulated() {
+ return recordKeyPopulated;
+ }
+
+ public boolean isPartitionPathPopulated() {
+ return partitionPathPopulated;
+ }
+
+ public boolean isFileNamePopulated() {
+ return fileNamePopulated;
+ }
+
+ /**
+ * @return {@code true} if a key generator is needed at read/write time to
reconstruct the
+ * record key or partition path. The key generator derives both from
source fields,
+ * so a single instance is needed if either {@code
_hoodie_record_key} or
+ * {@code _hoodie_partition_path} is not populated on disk.
+ */
+ public boolean isKeyGeneratorRequired() {
+ return !recordKeyPopulated || !partitionPathPopulated;
+ }
+
+ /**
+ * Validates a transition from one (populateMetaFields, excludeList) state
to another.
+ * The set of valid states forms a lattice that can only move in one
direction:
+ *
+ * <ul>
+ * <li>{@code populate=true,exclude=X} -> {@code populate=true,exclude=Y}:
allowed iff
+ * {@code Y} is a superset of {@code X} (i.e. transitions can only
widen the set
+ * of excluded fields, never narrow it). Reintroducing a
previously-excluded field
+ * would lie to readers about historical files written while it was
null.</li>
+ * <li>{@code populate=true,*} -> {@code populate=false}: always allowed.
The exclude
+ * list is irrelevant once meta fields are disabled entirely.</li>
+ * <li>{@code populate=false} -> {@code populate=true}: rejected.
Historical files
+ * have empty meta fields and cannot be retro-populated.</li>
+ * <li>{@code populate=false} -> {@code populate=false}: allowed
(no-op).</li>
+ * </ul>
+ *
+ * <p>Both exclude sets must contain only names from {@link
HoodieRecord#HOODIE_META_COLUMNS};
+ * an unknown name in {@code newExclude} is rejected so config typos fail
fast.
+ *
+ * @throws HoodieException if the transition is not allowed
+ * @throws IllegalArgumentException if {@code newExclude} contains an
unknown meta-field name
+ */
+ public static void validateTransition(boolean oldPopulate, Set<String>
oldExclude,
+ boolean newPopulate, Set<String>
newExclude) {
+ Set<String> safeOldExclude = oldExclude == null ? new HashSet<>() :
oldExclude;
+ Set<String> safeNewExclude = newExclude == null ? new HashSet<>() :
newExclude;
+
+ // Fail fast on unknown names in the new exclude list (typos).
+ Set<String> unknownNew = new HashSet<>(safeNewExclude);
+ unknownNew.removeAll(HoodieRecord.HOODIE_META_COLUMNS);
+ if (!unknownNew.isEmpty()) {
+ throw new IllegalArgumentException(
+ "Unknown meta field name(s) in exclusion list: " + new
TreeSet<>(unknownNew)
+ + ". Valid names are: " + HoodieRecord.HOODIE_META_COLUMNS);
+ }
+
+ if (!oldPopulate) {
+ // Absorbing state: once populate=false, the only allowed transition is
staying false.
+ if (newPopulate) {
+ throw new HoodieException(
+ "Cannot transition " + HoodieTableConfig.POPULATE_META_FIELDS.key()
+ + " from false to true. Historical files written with meta
fields disabled "
+ + "cannot be retro-populated, which would corrupt reads that
rely on meta "
+ + "field availability.");
+ }
+ return;
+ }
+
+ if (!newPopulate) {
+ // populate=true -> populate=false is always allowed; the prior exclude
list becomes
+ // irrelevant once meta fields are disabled entirely.
+ return;
+ }
+
+ // populate=true -> populate=true: new exclude set must be a (non-strict)
superset of old.
+ Set<String> removed = new HashSet<>(safeOldExclude);
+ removed.removeAll(safeNewExclude);
+ if (!removed.isEmpty()) {
+ throw new HoodieException(
+ "Cannot remove field(s) " + new TreeSet<>(removed) + " from "
+ + HoodieTableConfig.META_FIELDS_EXCLUDE_LIST.key()
+ + ". The exclude list is monotonic - fields can only be added,
never removed, "
+ + "because historical files written while a field was excluded
contain null for "
+ + "that field and cannot be retro-populated. Current exclude
list: "
+ + new TreeSet<>(safeOldExclude) + ", attempted: " + new
TreeSet<>(safeNewExclude) + ".");
+ }
+ }
+
+ /**
+ * @return true if at least one of the 5 meta fields is populated.
Equivalent to the
+ * long-standing {@code populateMetaFields} cardinal switch: false
only when
+ * meta fields are entirely disabled (or every field is in the
exclusion list).
+ */
+ public boolean isAnyPopulated() {
+ return commitTimePopulated || commitSeqNoPopulated || recordKeyPopulated
+ || partitionPathPopulated || fileNamePopulated;
+ }
+
+ /**
+ * @return {@code commitTime} if {@link #isCommitTimePopulated()}, else
{@code null}.
+ */
+ public String getCommitTime(String commitTime) {
+ return commitTimePopulated ? commitTime : null;
+ }
+
+ /**
+ * @return {@code commitSeqNo} if {@link #isCommitSeqNoPopulated()}, else
{@code null}.
+ */
+ public String getCommitSeqNo(String commitSeqNo) {
+ return commitSeqNoPopulated ? commitSeqNo : null;
Review Comment:
🤖 nit: `getCommitTime(String)`, `getRecordKey(String)`, etc. follow the
getter naming convention but take a parameter and act as conditional
pass-throughs. At call sites this reads oddly —
`populateFlags.getRecordKey(key.getRecordKey())` has two back-to-back
`.getRecordKey()` calls with entirely different semantics. Something like
`commitTimeOrNull(String)` / `recordKeyOrNull(String)` would make the "return
value-or-null based on flag" semantics immediately clear.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]