hudi-agent commented on code in PR #18384:
URL: https://github.com/apache/hudi/pull/18384#discussion_r3262548725
##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -1195,6 +1265,47 @@ public boolean populateMetaFields() {
return Boolean.parseBoolean(getStringOrDefault(POPULATE_META_FIELDS));
}
+ /**
+ * Checks if a specific meta field is excluded from population via
+ * {@link #META_FIELDS_EXCLUDE_LIST}.
+ *
+ * @param metaFieldName the meta field name to check (e.g. {@code
_hoodie_commit_time})
+ * @return true if the field is in the exclusion list
+ */
+ public boolean isMetaFieldExcluded(String metaFieldName) {
+ return
HoodieMetaFieldFlags.parseExcludeList(getString(META_FIELDS_EXCLUDE_LIST))
+ .contains(metaFieldName);
+ }
+
+ /**
+ * Returns the {@link HoodieMetaFieldFlags} reflecting POPULATE_META_FIELDS
and
+ * META_FIELDS_EXCLUDE_LIST as persisted on this table. This is the source
of truth
+ * for both writers and readers - {@link HoodieMetaFieldFlags} should never
be sourced
+ * from a writer-side config since the persisted table state is what
determines on-disk
+ * meta-field availability across commits.
+ */
+ public HoodieMetaFieldFlags getHoodieMetaFieldFlags() {
+ return HoodieMetaFieldFlags.fromConfig(this);
+ }
+
+ /**
+ * Overlays the persisted {@link #POPULATE_META_FIELDS} and {@link
#META_FIELDS_EXCLUDE_LIST}
+ * values onto the supplied {@link HoodieConfig} so that downstream
consumers - in particular
+ * the file-writer factories that only see a {@code HoodieConfig} - read the
on-disk state
+ * even if the writer config was constructed without the table-config merge
step.
+ *
+ * <p>Mutates and returns the input. Call sites that need to preserve the
original config
+ * should pass in a copy. The two keys are always written, including {@code
META_FIELDS_EXCLUDE_LIST}
Review Comment:
🤖 nit: since both call sites discard the return value
(`overlayMetaFieldProps(config)` as a bare statement), would you consider
making the return type `void`? The current `HoodieConfig` return makes readers
wonder whether a modified copy is returned — but the method mutates in place.
`void` would make the side-effect intent unambiguous.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriterFactory.java:
##########
@@ -115,8 +116,10 @@ private static HoodieInternalRowFileWriter
newLanceInternalRowFileWriter(Storage
}
private static Option<BloomFilter>
tryInstantiateBloomFilter(HoodieWriteConfig writeConfig) {
- // NOTE: Currently Bloom Filter is only going to be populated if
meta-fields are populated
- if (writeConfig.populateMetaFields()) {
+ // Bloom filter indexes the _hoodie_record_key value passed at write time.
It is only
Review Comment:
🤖 nit: the last clause 'either populateMetaFields=false or
_hoodie_record_key in META_FIELDS_EXCLUDE_LIST disables it' re-exposes the two
internal conditions that `isRecordKeyPopulated()` already encapsulates, and the
subject of 'it' is ambiguous. Something like 'Bloom filter is only viable when
`_hoodie_record_key` is populated on disk; `isRecordKeyPopulated()` covers both
the `populateMetaFields=false` and the exclude-list cases.' would be clearer.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordScanner.java:
##########
@@ -201,14 +202,19 @@ protected AbstractHoodieLogRecordScanner(HoodieStorage
storage, String basePath,
this.populateMetaFields = false;
this.recordKeyField = keyFieldOverride.get();
this.partitionPathFieldOpt = Option.empty();
- } else if (tableConfig.populateMetaFields()) {
- this.populateMetaFields = true;
- this.recordKeyField = HoodieRecord.RECORD_KEY_METADATA_FIELD;
- this.partitionPathFieldOpt =
Option.of(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
} else {
- this.populateMetaFields = false;
- this.recordKeyField = tableConfig.getRecordKeyFieldProp();
- this.partitionPathFieldOpt =
Option.of(tableConfig.getPartitionFieldProp());
+ // _hoodie_record_key may be null in log blocks when the column is
excluded via
+ // META_FIELDS_EXCLUDE_LIST (even with populate.meta.fields=true); in
that case the
+ // reader must read the configured source record-key field instead.
Partition path uses
+ // the analogous flag.
+ HoodieMetaFieldFlags flags =
HoodieMetaFieldFlags.fromConfig(tableConfig);
+ this.populateMetaFields = flags.isRecordKeyPopulated();
Review Comment:
🤖 I think this still NPEs when only `_hoodie_partition_path` is excluded
(record_key populated, populate=true). `populateMetaFields` ends up `true`, so
at line 466 the pair is `Option.empty()` and `populateMetaFields=true` is
passed to `wrapIntoHoodieRecordPayloadWithParams`. In
`createHoodieRecordFromAvro`, the `if (populateMetaFields)` branch calls the
4-arg `SpillableMapUtils.convertToHoodieRecordPayload`, which uses a hardcoded
`(META_RECORD_KEY, META_PARTITION_PATH)` pair and `Option.empty()` partition
name — so `record.get(_hoodie_partition_path).toString()` NPEs on the null
column. Could `populateMetaFields` here be `flags.isRecordKeyPopulated() &&
flags.isPartitionPathPopulated()` (or could the pair always be passed instead
of relying on the `populateMetaFields` flag)? @nsivabalan could you weigh in on
the safest fix here?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]