Re: [PR] chore: Test Runtime Improvements: lower number of files, parallelize reads [hudi]

via GitHub Tue, 23 Dec 2025 10:45:25 -0800


the-other-tim-brown commented on code in PR #17671:
URL: https://github.com/apache/hudi/pull/17671#discussion_r2643998041



##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##########
@@ -236,7 +236,7 @@ protected EngineType getEngineType() {
    * @return Config Builder
    */
   public HoodieWriteConfig.Builder 
getConfigBuilder(HoodieFailedWritesCleaningPolicy cleaningPolicy) {
-    return getConfigBuilder(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA, 
HoodieIndex.IndexType.BLOOM, cleaningPolicy);
+    return getConfigBuilder(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA, 
HoodieIndex.IndexType.SIMPLE, cleaningPolicy);

Review Comment:
   Updated the index default to match the default for users by changing from 
`BLOOM` to `SIMPLE`. This is more efficient with the very small tables we write 
in our tests.



##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java:
##########
@@ -104,33 +105,36 @@ public static List<GenericRecord> 
getRecordsUsingInputFormat(StorageConfiguratio
         .map(HoodieAvroUtils::createNewSchemaField)
         .collect(Collectors.toList()));
 
-    List<GenericRecord> records = new ArrayList<>();
     try {
       FileInputFormat.setInputPaths(jobConf, String.join(",", inputPaths));
       InputSplit[] splits = inputFormat.getSplits(jobConf, inputPaths.size());
-
-      for (InputSplit split : splits) {
-        RecordReader recordReader = inputFormat.getRecordReader(split, 
jobConf, null);
-        Object key = recordReader.createKey();
-        ArrayWritable writable = (ArrayWritable) recordReader.createValue();
-        while (recordReader.next(key, writable)) {
-          GenericRecordBuilder newRecord = new 
GenericRecordBuilder(projectedSchema);
-          // writable returns an array with [field1, field2, 
_hoodie_commit_time,
-          // _hoodie_commit_seqno]
-          Writable[] values = writable.get();
-          schema.getFields().stream()
-              .filter(f -> !projectCols || projectedColumns.contains(f.name()))
-              .map(f -> Pair.of(projectedSchema.getFields().stream()
-                  .filter(p -> f.name().equals(p.name())).findFirst().get(), 
f))
-              .forEach(fieldsPair -> newRecord.set(fieldsPair.getKey(), 
values[fieldsPair.getValue().pos()]));
-          records.add(newRecord.build());
+      return Arrays.stream(splits).parallel().flatMap(split -> {

Review Comment:
   We are testing over small files, split across multiple partitions so reading 
in parallel should not cause memory issues.



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableCompaction.java:
##########
@@ -183,59 +180,53 @@ public void testWriteDuringCompaction(String 
payloadClass, HoodieIndex.IndexType
   @ParameterizedTest
   @MethodSource("writeLogTest")
   public void testWriteLogDuringCompaction(boolean enableMetadataTable, 
boolean enableTimelineServer) throws IOException {
-    try {
-      //disable for this test because it seems like we process mor in a 
different order?
-      
jsc().hadoopConfiguration().set(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(),
 "false");

Review Comment:
   Disabling MDT is not needed



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/TestHoodieClientMultiWriter.java:
##########
@@ -457,7 +457,7 @@ private void 
testHoodieClientBasicMultiWriterWithEarlyConflictDetection(String t
       setUpMORTestTable();
     }
 
-    int heartBeatIntervalForCommit4 = 10 * 1000;
+    int heartBeatIntervalForCommit4 = 3 * 1000;

Review Comment:
   This is controlling a sleep interval which was causing the test to run for a 
minimum of 30 seconds



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -251,7 +251,12 @@ public void testMetadataTableBootstrap(HoodieTableType 
tableType, boolean addRol
     // bootstrap with few commits
     doPreBootstrapOperations(testTable);
 
-    writeConfig = getWriteConfig(true, true);
+    writeConfig = getWriteConfigBuilder(true, true, false)

Review Comment:
   Changes to this file revolve around skipping the col-stats index where 
possible since the TestTable will not write real files causing a lot of errors 
or warnings in the logs which makes it hard to debug CI issues



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieFileSystemViews.java:
##########
@@ -91,6 +91,10 @@ public static List<Arguments> tableTypeMetadataFSVTypeArgs() 
{
     for (HoodieTableType tableType : HoodieTableType.values()) {
       for (boolean enableMdt : Arrays.asList(true, false)) {
         for (FileSystemViewStorageType viewStorageType : 
Arrays.asList(FileSystemViewStorageType.MEMORY, 
FileSystemViewStorageType.SPILLABLE_DISK)) {
+          if (!enableMdt && viewStorageType == 
FileSystemViewStorageType.MEMORY) {
+            // This is the baseline case, no need to test here.

Review Comment:
   The test is responsible for comparing an in-memory view without MDT to 
another view so this case was essentially comparing the view to itself which 
just cost test time.



##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestConsistentBucketIndex.java:
##########
@@ -113,7 +113,7 @@ private void setUp(boolean populateMetaFields, boolean 
partitioned) throws Excep
             .withIndexType(HoodieIndex.IndexType.BUCKET)
             .withIndexKeyField("_row_key")
             
.withBucketIndexEngineType(HoodieIndex.BucketIndexEngineType.CONSISTENT_HASHING)
-            .withBucketNum("8")
+            .withBucketNum("4")

Review Comment:
   8 buckets across 3 partitions was causing some unnecessary overhead. This 
helps improve test runtime



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala:
##########
@@ -204,8 +204,9 @@ class TestRecordLevelIndex extends RecordLevelIndexTestBase 
with SparkDatasetMix
     val holder = new testRecordLevelIndexHolder
     testRecordLevelIndex(testCase.tableType, testCase.streamingWriteEnabled, 
holder)
     val writeConfig = getWriteConfig(holder.options)
-    new SparkRDDWriteClient(new HoodieSparkEngineContext(jsc), writeConfig)
-      
.rollback(metaClient.getActiveTimeline.lastInstant().get().requestedTime())
+    val writeClient = new SparkRDDWriteClient(new 
HoodieSparkEngineContext(jsc), writeConfig)
+    
writeClient.rollback(metaClient.getActiveTimeline.lastInstant().get().requestedTime())
+    writeClient.close()

Review Comment:
   changes to this file are to make sure clients are always closed



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMetadataTableWithSparkDataSource.scala:
##########
@@ -121,7 +121,7 @@ class TestMetadataTableWithSparkDataSource extends 
SparkClientFunctionalTestHarn
     // Files partition of MT
     val filesPartitionDF = 
spark.read.format(hudi).load(s"$basePath/.hoodie/metadata/files")
     // Smoke test
-    filesPartitionDF.show()
+    assertEquals(4, filesPartitionDF.collect().length)

Review Comment:
   Avoid `show` in this file to cut down on noise in the CI logs



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java:
##########
@@ -621,12 +621,6 @@ protected void validateFilesPerPartition(HoodieTestTable 
testTable,
         tableView.getAllFileGroups(partition).collect(Collectors.toList());
     
fileGroups.addAll(tableView.getAllReplacedFileGroups(partition).collect(Collectors.toList()));
 
-    fileGroups.forEach(g -> log.info(g.toString()));

Review Comment:
   Logging doesn't actually provide any benefit here, it is just costing CPU 
and causing noise



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] chore: Test Runtime Improvements: lower number of files, parallelize reads [hudi]

Reply via email to