hudi-agent commented on code in PR #18384:
URL: https://github.com/apache/hudi/pull/18384#discussion_r3262548725


##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -1195,6 +1265,47 @@ public boolean populateMetaFields() {
     return Boolean.parseBoolean(getStringOrDefault(POPULATE_META_FIELDS));
   }
 
+  /**
+   * Checks if a specific meta field is excluded from population via
+   * {@link #META_FIELDS_EXCLUDE_LIST}.
+   *
+   * @param metaFieldName the meta field name to check (e.g. {@code 
_hoodie_commit_time})
+   * @return true if the field is in the exclusion list
+   */
+  public boolean isMetaFieldExcluded(String metaFieldName) {
+    return 
HoodieMetaFieldFlags.parseExcludeList(getString(META_FIELDS_EXCLUDE_LIST))
+        .contains(metaFieldName);
+  }
+
+  /**
+   * Returns the {@link HoodieMetaFieldFlags} reflecting POPULATE_META_FIELDS 
and
+   * META_FIELDS_EXCLUDE_LIST as persisted on this table. This is the source 
of truth
+   * for both writers and readers - {@link HoodieMetaFieldFlags} should never 
be sourced
+   * from a writer-side config since the persisted table state is what 
determines on-disk
+   * meta-field availability across commits.
+   */
+  public HoodieMetaFieldFlags getHoodieMetaFieldFlags() {
+    return HoodieMetaFieldFlags.fromConfig(this);
+  }
+
+  /**
+   * Overlays the persisted {@link #POPULATE_META_FIELDS} and {@link 
#META_FIELDS_EXCLUDE_LIST}
+   * values onto the supplied {@link HoodieConfig} so that downstream 
consumers - in particular
+   * the file-writer factories that only see a {@code HoodieConfig} - read the 
on-disk state
+   * even if the writer config was constructed without the table-config merge 
step.
+   *
+   * <p>Mutates and returns the input. Call sites that need to preserve the 
original config
+   * should pass in a copy. The two keys are always written, including {@code 
META_FIELDS_EXCLUDE_LIST}

Review Comment:
   🤖 nit: since both call sites discard the return value 
(`overlayMetaFieldProps(config)` as a bare statement), would you consider 
making the return type `void`? The current `HoodieConfig` return makes readers 
wonder whether a modified copy is returned — but the method mutates in place. 
`void` would make the side-effect intent unambiguous.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowFileWriterFactory.java:
##########
@@ -115,8 +116,10 @@ private static HoodieInternalRowFileWriter 
newLanceInternalRowFileWriter(Storage
   }
 
   private static Option<BloomFilter> 
tryInstantiateBloomFilter(HoodieWriteConfig writeConfig) {
-    // NOTE: Currently Bloom Filter is only going to be populated if 
meta-fields are populated
-    if (writeConfig.populateMetaFields()) {
+    // Bloom filter indexes the _hoodie_record_key value passed at write time. 
It is only

Review Comment:
   🤖 nit: the last clause 'either populateMetaFields=false or 
_hoodie_record_key in META_FIELDS_EXCLUDE_LIST disables it' re-exposes the two 
internal conditions that `isRecordKeyPopulated()` already encapsulates, and the 
subject of 'it' is ambiguous. Something like 'Bloom filter is only viable when 
`_hoodie_record_key` is populated on disk; `isRecordKeyPopulated()` covers both 
the `populateMetaFields=false` and the exclude-list cases.' would be clearer.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordScanner.java:
##########
@@ -201,14 +202,19 @@ protected AbstractHoodieLogRecordScanner(HoodieStorage 
storage, String basePath,
       this.populateMetaFields = false;
       this.recordKeyField = keyFieldOverride.get();
       this.partitionPathFieldOpt = Option.empty();
-    } else if (tableConfig.populateMetaFields()) {
-      this.populateMetaFields = true;
-      this.recordKeyField = HoodieRecord.RECORD_KEY_METADATA_FIELD;
-      this.partitionPathFieldOpt = 
Option.of(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
     } else {
-      this.populateMetaFields = false;
-      this.recordKeyField = tableConfig.getRecordKeyFieldProp();
-      this.partitionPathFieldOpt = 
Option.of(tableConfig.getPartitionFieldProp());
+      // _hoodie_record_key may be null in log blocks when the column is 
excluded via
+      // META_FIELDS_EXCLUDE_LIST (even with populate.meta.fields=true); in 
that case the
+      // reader must read the configured source record-key field instead. 
Partition path uses
+      // the analogous flag.
+      HoodieMetaFieldFlags flags = 
HoodieMetaFieldFlags.fromConfig(tableConfig);
+      this.populateMetaFields = flags.isRecordKeyPopulated();

Review Comment:
   🤖 I think this still NPEs when only `_hoodie_partition_path` is excluded 
(record_key populated, populate=true). `populateMetaFields` ends up `true`, so 
at line 466 the pair is `Option.empty()` and `populateMetaFields=true` is 
passed to `wrapIntoHoodieRecordPayloadWithParams`. In 
`createHoodieRecordFromAvro`, the `if (populateMetaFields)` branch calls the 
4-arg `SpillableMapUtils.convertToHoodieRecordPayload`, which uses a hardcoded 
`(META_RECORD_KEY, META_PARTITION_PATH)` pair and `Option.empty()` partition 
name — so `record.get(_hoodie_partition_path).toString()` NPEs on the null 
column. Could `populateMetaFields` here be `flags.isRecordKeyPopulated() && 
flags.isPartitionPathPopulated()` (or could the pair always be passed instead 
of relying on the `populateMetaFields` flag)? @nsivabalan could you weigh in on 
the safest fix here?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to