Re: [PR] [HUDI-9311] Revert HUDI-7146 which causes perf overhead for merging MDT log files [hudi]

via GitHub Sun, 13 Apr 2025 17:43:44 -0700


the-other-tim-brown commented on code in PR #13136:
URL: https://github.com/apache/hudi/pull/13136#discussion_r2041259134



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##########
@@ -199,26 +199,26 @@ public byte[] serializeRecordsToLogBlock(HoodieStorage 
storage,
       }
 
       final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
-      // If key exists in the map, append to its list. If not, create a new 
list.
-      // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
-      List<byte[]> recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
-      recordBytesList.add(recordBytes);
-      // Put the updated list back into the map
-      sortedRecordsMap.put(recordKey, recordBytesList);
+      if (sortedRecordsMap.containsKey(recordKey)) {
+        LOG.error("Found duplicate record with recordKey: " + recordKey);
+        printRecord("Previous record", sortedRecordsMap.get(recordKey), 
writerSchema);
+        printRecord("Current record", recordBytes, writerSchema);
+        throw new HoodieException(String.format("Writing multiple records with 
same key %s not supported for %s",
+            recordKey, this.getClass().getName()));

Review Comment:
   `this.getClass().getName()` will be `HFileUtils`, should this say something 
else like "for record index"?



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##########
@@ -232,6 +232,15 @@ public byte[] serializeRecordsToLogBlock(HoodieStorage 
storage,
     return baos.toByteArray();
   }
 
+  /**
+   * Print the record in json format
+   */
+  private void printRecord(String msg, byte[] bs, Schema schema) throws 
IOException {
+    GenericRecord record = HoodieAvroUtils.bytesToAvro(bs, schema);
+    byte[] json = HoodieAvroUtils.avroToJson(record, true);
+    LOG.error(String.format("%s: %s", msg, new String(json)));

Review Comment:
   Just want to make sure this is just metadata and not actual data since that 
can leak sensitive information into the logs.
   
   If so, also update this to use `{}`



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java:
##########
@@ -199,26 +199,26 @@ public byte[] serializeRecordsToLogBlock(HoodieStorage 
storage,
       }
 
       final byte[] recordBytes = serializeRecord(record, writerSchema, 
keyFieldName);
-      // If key exists in the map, append to its list. If not, create a new 
list.
-      // Get the existing list of recordBytes for the recordKey, or an empty 
list if it doesn't exist
-      List<byte[]> recordBytesList = sortedRecordsMap.getOrDefault(recordKey, 
new ArrayList<>());
-      recordBytesList.add(recordBytes);
-      // Put the updated list back into the map
-      sortedRecordsMap.put(recordKey, recordBytesList);
+      if (sortedRecordsMap.containsKey(recordKey)) {
+        LOG.error("Found duplicate record with recordKey: " + recordKey);

Review Comment:
   Use `{}` instead of string concatenation



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroHFileWriter.java:
##########
@@ -73,14 +72,15 @@ public class HoodieAvroHFileWriter
   private final String instantTime;
   private final TaskContextSupplier taskContextSupplier;
   private final boolean populateMetaFields;
+  private final Schema schema;
   private final Option<Schema.Field> keyFieldSchema;
   private HFile.Writer writer;
   private String minRecordKey;
   private String maxRecordKey;
   private String prevRecordKey;
 
   // This is private in CacheConfig so have been copied here.
-  private static final String DROP_BEHIND_CACHE_COMPACTION_KEY = 
"hbase.hfile.drop.behind.compaction";
+  private static String DROP_BEHIND_CACHE_COMPACTION_KEY = 
"hbase.hfile.drop.behind.compaction";

Review Comment:
   keep this as `final`?



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java:
##########
@@ -1980,6 +1980,9 @@ public void testColStatsPrefixLookup() throws IOException 
{
 
       // there are 3 partitions in total and 2 commits. total entries should 
be 6.
       assertEquals(result.size(), 6);
+      result.forEach(entry -> {
+        //LOG.warn("Prefix search entries just for record key col : " + 
entry.getRecordKey().toString() + " :: " + 
entry.getData().getColumnStatMetadata().get().toString());

Review Comment:
   Remove this commented out code and similarly do the same below?



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroHFileWriter.java:
##########
@@ -73,14 +72,15 @@ public class HoodieAvroHFileWriter
   private final String instantTime;
   private final TaskContextSupplier taskContextSupplier;
   private final boolean populateMetaFields;
+  private final Schema schema;

Review Comment:
   If this is unused, let's keep it removed



##########
hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieHBaseAvroHFileReader.java:
##########
@@ -370,10 +370,7 @@ public IndexedRecord next() {
     return new KeyPrefixIterator();
   }
 
-  private static Iterator<IndexedRecord> 
getRecordByKeyIteratorInternal(HFileScanner scanner,
-                                                                        String 
key,
-                                                                        Schema 
writerSchema,
-                                                                        Schema 
readerSchema) throws IOException {
+  private static Option<IndexedRecord> fetchRecordByKeyInternal(HFileScanner 
scanner, String key, Schema writerSchema, Schema readerSchema) throws 
IOException {
     KeyValue kv = new KeyValue(getUTF8Bytes(key), null, null, null);

Review Comment:
   `getUTF8Bytes(key)` is evaluated twice, here and at line 397. Let's move 
this into a variable to ensure it's only computed once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9311] Revert HUDI-7146 which causes perf overhead for merging MDT log files [hudi]

Reply via email to