[GitHub] [hudi] nsivabalan commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

GitBox Sat, 13 Nov 2021 16:51:29 -0800


nsivabalan commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r748781984




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable 
table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = 
table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, 
readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : 
Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = 
HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new 
Path(clusteringOp.getDataFilePath()));
+            HoodieTableConfig tableConfig = 
table.getMetaClient().getTableConfig();
+            recordIterators.add(getFileSliceReader(baseFileReader, scanner, 
readerSchema,
+                tableConfig.getPayloadClass(),
+                tableConfig.getPreCombineField(),
+                tableConfig.populateMetaFields() ? Option.empty() : 
Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
+                    tableConfig.getPartitionFieldProp()))));
+          } else {
+            // Since there is no base file, fall back to reading log files
+            Iterable<HoodieRecord<? extends HoodieRecordPayload>> iterable = 
() -> scanner.iterator();

Review comment:
       +1 was about to suggest the same. We are nearing the release though. So, 
I would suggest to time bound. If not, atleast file a tracking ticket. 

##########
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableClustering.java
##########
@@ -144,28 +144,109 @@ void testClustering(boolean doUpdates, boolean 
populateMetaFields, boolean prese
       assertEquals(allFiles.length, 
hoodieTable.getFileSystemView().getFileGroupsInPendingClustering().map(Pair::getLeft).count());
 
       // Do the clustering and validate
-      client.cluster(clusteringCommitTime, true);
+      doClusteringAndValidate(client, clusteringCommitTime, metaClient, cfg, 
dataGen);
+    }
+  }
 
-      metaClient = HoodieTableMetaClient.reload(metaClient);
-      final HoodieTable clusteredTable = HoodieSparkTable.create(cfg, 
context(), metaClient);
-      clusteredTable.getHoodieView().sync();
-      Stream<HoodieBaseFile> dataFilesToRead = 
Arrays.stream(dataGen.getPartitionPaths())
-          .flatMap(p -> 
clusteredTable.getBaseFileOnlyView().getLatestBaseFiles(p));
-      // verify there should be only one base file per partition after 
clustering.
-      assertEquals(dataGen.getPartitionPaths().length, 
dataFilesToRead.count());
-
-      HoodieTimeline timeline = 
metaClient.getCommitTimeline().filterCompletedInstants();
-      assertEquals(1, timeline.findInstantsAfter("003", 
Integer.MAX_VALUE).countInstants(),
-          "Expecting a single commit.");
-      assertEquals(clusteringCommitTime, 
timeline.lastInstant().get().getTimestamp());
-      assertEquals(HoodieTimeline.REPLACE_COMMIT_ACTION, 
timeline.lastInstant().get().getAction());
-      if (cfg.populateMetaFields()) {
-        assertEquals(400, 
HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(), 
sqlContext(), timeline, Option.of("000")),
-            "Must contain 200 records");
-      } else {
-        assertEquals(400, 
HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(), 
sqlContext(), timeline, Option.empty()));
+  private static Stream<Arguments> testClusteringWithNoBaseFiles() {
+    return Stream.of(
+        Arguments.of(true, true),
+        Arguments.of(true, false),
+        Arguments.of(false, true),
+        Arguments.of(false, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testClusteringWithNoBaseFiles(boolean doUpdates, boolean 
preserveCommitMetadata) throws Exception {

Review comment:
       something to think about. Do we need to test out preserveCommitMetadata 
combinations here as well? we should be mindful of total run time of all tests. 
Try to reduce parametrized tests if possible. will leave it to you to take a 
call if its required. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Reply via email to