codope commented on a change in pull request #4840:
URL: https://github.com/apache/hudi/pull/4840#discussion_r810993836
##########
File path:
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestMergeOnReadRollbackActionExecutor.java
##########
@@ -169,4 +318,13 @@ public void testRollbackWhenFirstCommitFail() throws
Exception {
client.rollback("001");
}
}
+
+ private void setUpDFS() throws IOException {
+ initDFS();
+ initSparkContexts();
+ //just generate tow partitions
Review comment:
nit: `two` spell-check.
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
##########
@@ -156,19 +156,21 @@ private void initKeyGenIfNeeded(boolean
populateMetaFields) {
LOG.info("RDD PreppedRecords was persisted at: " +
inputRecordsRDD.getStorageLevel());
}
- WorkloadProfile profile = null;
+ WorkloadProfile workloadProfile = null;
if (isWorkloadProfileNeeded()) {
context.setJobStatus(this.getClass().getSimpleName(), "Building workload
profile");
- profile = new WorkloadProfile(buildProfile(inputRecordsRDD),
operationType);
- LOG.info("Workload profile :" + profile);
- saveWorkloadProfileMetadataToInflight(profile, instantTime);
+ workloadProfile = new WorkloadProfile(buildProfile(inputRecordsRDD),
operationType);
+ LOG.info("Input workload profile :" + workloadProfile);
}
// handle records update with clustering
JavaRDD<HoodieRecord<T>> inputRecordsRDDWithClusteringUpdate =
clusteringHandleUpdate(inputRecordsRDD);
// partition using the insert partitioner
- final Partitioner partitioner = getPartitioner(profile);
+ final Partitioner partitioner = getPartitioner(workloadProfile);
+ if (isWorkloadProfileNeeded()) {
+ saveWorkloadProfileMetadataToInflight(workloadProfile, instantTime);
Review comment:
Just for my understanding, why do we want to save workload profile after
handling clustering updates? Suppose `clusteringHandleUpdate` threw exception
then the profile will never be saved in metadata right?
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java
##########
@@ -235,6 +243,7 @@ private void assignInserts(WorkloadProfile profile,
HoodieEngineContext context)
LOG.info("Total insert buckets for partition path " + partitionPath +
" => " + insertBuckets);
partitionPathToInsertBucketInfos.put(partitionPath, insertBuckets);
}
+ profile.getOutputPartitionPathStatMap().putIfAbsent(partitionPath,
outputWorkloadStats);
Review comment:
Why `putIfAbsent`? Should we not update the stat in this case?
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
##########
@@ -156,19 +156,21 @@ private void initKeyGenIfNeeded(boolean
populateMetaFields) {
LOG.info("RDD PreppedRecords was persisted at: " +
inputRecordsRDD.getStorageLevel());
}
- WorkloadProfile profile = null;
+ WorkloadProfile workloadProfile = null;
if (isWorkloadProfileNeeded()) {
context.setJobStatus(this.getClass().getSimpleName(), "Building workload
profile");
- profile = new WorkloadProfile(buildProfile(inputRecordsRDD),
operationType);
- LOG.info("Workload profile :" + profile);
- saveWorkloadProfileMetadataToInflight(profile, instantTime);
+ workloadProfile = new WorkloadProfile(buildProfile(inputRecordsRDD),
operationType);
+ LOG.info("Input workload profile :" + workloadProfile);
}
// handle records update with clustering
JavaRDD<HoodieRecord<T>> inputRecordsRDDWithClusteringUpdate =
clusteringHandleUpdate(inputRecordsRDD);
// partition using the insert partitioner
- final Partitioner partitioner = getPartitioner(profile);
+ final Partitioner partitioner = getPartitioner(workloadProfile);
+ if (isWorkloadProfileNeeded()) {
Review comment:
Alternatively, the condition can be changed to `if (workloadProfile !=
null)` to make it more readbale imo. But, i'll leave it upto you.
##########
File path:
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/rollback/TestMergeOnReadRollbackActionExecutor.java
##########
@@ -139,6 +157,137 @@ public void testMergeOnReadRollbackActionExecutor(boolean
isUsingMarkers) throws
assertFalse(WriteMarkersFactory.get(cfg.getMarkersType(), table,
"002").doesMarkerDirExist());
}
+ @Test
+ public void testRollbackForCanIndexLogFile() throws IOException {
+ cleanupResources();
+ setUpDFS();
+ //1. prepare data and assert data result
+ //just generate one partitions
+ dataGen = new HoodieTestDataGenerator(new
String[]{DEFAULT_FIRST_PARTITION_PATH});
+ HoodieWriteConfig cfg =
HoodieWriteConfig.newBuilder().withPath(basePath).withSchema(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA)
+ .withParallelism(2,
2).withBulkInsertParallelism(2).withFinalizeWriteParallelism(2).withDeleteParallelism(2)
+ .withTimelineLayoutVersion(TimelineLayoutVersion.CURR_VERSION)
+ .withWriteStatusClass(MetadataMergeWriteStatus.class)
+
.withConsistencyGuardConfig(ConsistencyGuardConfig.newBuilder().withConsistencyCheckEnabled(true).build())
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().compactionSmallFileSize(1024
* 1024).build())
+
.withStorageConfig(HoodieStorageConfig.newBuilder().hfileMaxFileSize(1024 *
1024).parquetMaxFileSize(1024 * 1024).build())
+ .forTable("test-trip-table")
+
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.INMEMORY).build())
+
.withEmbeddedTimelineServerEnabled(true).withFileSystemViewConfig(FileSystemViewStorageConfig.newBuilder()
+ .withEnableBackupForRemoteFileSystemView(false) // Fail test if
problem connecting to timeline-server
+
.withStorageType(FileSystemViewStorageType.EMBEDDED_KV_STORE).build()).withRollbackUsingMarkers(false).withAutoCommit(false).build();
+
+ //1. prepare data
+ HoodieTestDataGenerator.writePartitionMetadata(fs, new
String[]{DEFAULT_FIRST_PARTITION_PATH}, basePath);
+ SparkRDDWriteClient client = getHoodieWriteClient(cfg);
+ /**
Review comment:
maybe we can remove this multi-line comment here and below?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]