nsivabalan commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r748781984
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable
table, HoodieEngineContext
.withSpillableMapBasePath(config.getSpillableMapBasePath())
.build();
- HoodieTableConfig tableConfig =
table.getMetaClient().getTableConfig();
- recordIterators.add(getFileSliceReader(baseFileReader, scanner,
readerSchema,
- tableConfig.getPayloadClass(),
- tableConfig.getPreCombineField(),
- tableConfig.populateMetaFields() ? Option.empty() :
Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
- tableConfig.getPartitionFieldProp()))));
+ if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+ HoodieFileReader<? extends IndexedRecord> baseFileReader =
HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new
Path(clusteringOp.getDataFilePath()));
+ HoodieTableConfig tableConfig =
table.getMetaClient().getTableConfig();
+ recordIterators.add(getFileSliceReader(baseFileReader, scanner,
readerSchema,
+ tableConfig.getPayloadClass(),
+ tableConfig.getPreCombineField(),
+ tableConfig.populateMetaFields() ? Option.empty() :
Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
+ tableConfig.getPartitionFieldProp()))));
+ } else {
+ // Since there is no base file, fall back to reading log files
+ Iterable<HoodieRecord<? extends HoodieRecordPayload>> iterable =
() -> scanner.iterator();
Review comment:
+1 was about to suggest the same. We are nearing the release though. So,
I would suggest to time bound. If not, atleast file a tracking ticket.
##########
File path:
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableClustering.java
##########
@@ -144,28 +144,109 @@ void testClustering(boolean doUpdates, boolean
populateMetaFields, boolean prese
assertEquals(allFiles.length,
hoodieTable.getFileSystemView().getFileGroupsInPendingClustering().map(Pair::getLeft).count());
// Do the clustering and validate
- client.cluster(clusteringCommitTime, true);
+ doClusteringAndValidate(client, clusteringCommitTime, metaClient, cfg,
dataGen);
+ }
+ }
- metaClient = HoodieTableMetaClient.reload(metaClient);
- final HoodieTable clusteredTable = HoodieSparkTable.create(cfg,
context(), metaClient);
- clusteredTable.getHoodieView().sync();
- Stream<HoodieBaseFile> dataFilesToRead =
Arrays.stream(dataGen.getPartitionPaths())
- .flatMap(p ->
clusteredTable.getBaseFileOnlyView().getLatestBaseFiles(p));
- // verify there should be only one base file per partition after
clustering.
- assertEquals(dataGen.getPartitionPaths().length,
dataFilesToRead.count());
-
- HoodieTimeline timeline =
metaClient.getCommitTimeline().filterCompletedInstants();
- assertEquals(1, timeline.findInstantsAfter("003",
Integer.MAX_VALUE).countInstants(),
- "Expecting a single commit.");
- assertEquals(clusteringCommitTime,
timeline.lastInstant().get().getTimestamp());
- assertEquals(HoodieTimeline.REPLACE_COMMIT_ACTION,
timeline.lastInstant().get().getAction());
- if (cfg.populateMetaFields()) {
- assertEquals(400,
HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(),
sqlContext(), timeline, Option.of("000")),
- "Must contain 200 records");
- } else {
- assertEquals(400,
HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(),
sqlContext(), timeline, Option.empty()));
+ private static Stream<Arguments> testClusteringWithNoBaseFiles() {
+ return Stream.of(
+ Arguments.of(true, true),
+ Arguments.of(true, false),
+ Arguments.of(false, true),
+ Arguments.of(false, false)
+ );
+ }
+
+ @ParameterizedTest
+ @MethodSource
+ void testClusteringWithNoBaseFiles(boolean doUpdates, boolean
preserveCommitMetadata) throws Exception {
Review comment:
something to think about. Do we need to test out preserveCommitMetadata
combinations here as well? we should be mindful of total run time of all tests.
Try to reduce parametrized tests if possible. will leave it to you to take a
call if its required.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]