the-other-tim-brown commented on code in PR #17671:
URL: https://github.com/apache/hudi/pull/17671#discussion_r2643998041
##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/utils/HoodieWriterClientTestHarness.java:
##########
@@ -236,7 +236,7 @@ protected EngineType getEngineType() {
* @return Config Builder
*/
public HoodieWriteConfig.Builder
getConfigBuilder(HoodieFailedWritesCleaningPolicy cleaningPolicy) {
- return getConfigBuilder(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA,
HoodieIndex.IndexType.BLOOM, cleaningPolicy);
+ return getConfigBuilder(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA,
HoodieIndex.IndexType.SIMPLE, cleaningPolicy);
Review Comment:
Updated the index default to match the default for users by changing from
`BLOOM` to `SIMPLE`. This is more efficient with the very small tables we write
in our tests.
##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java:
##########
@@ -104,33 +105,36 @@ public static List<GenericRecord>
getRecordsUsingInputFormat(StorageConfiguratio
.map(HoodieAvroUtils::createNewSchemaField)
.collect(Collectors.toList()));
- List<GenericRecord> records = new ArrayList<>();
try {
FileInputFormat.setInputPaths(jobConf, String.join(",", inputPaths));
InputSplit[] splits = inputFormat.getSplits(jobConf, inputPaths.size());
-
- for (InputSplit split : splits) {
- RecordReader recordReader = inputFormat.getRecordReader(split,
jobConf, null);
- Object key = recordReader.createKey();
- ArrayWritable writable = (ArrayWritable) recordReader.createValue();
- while (recordReader.next(key, writable)) {
- GenericRecordBuilder newRecord = new
GenericRecordBuilder(projectedSchema);
- // writable returns an array with [field1, field2,
_hoodie_commit_time,
- // _hoodie_commit_seqno]
- Writable[] values = writable.get();
- schema.getFields().stream()
- .filter(f -> !projectCols || projectedColumns.contains(f.name()))
- .map(f -> Pair.of(projectedSchema.getFields().stream()
- .filter(p -> f.name().equals(p.name())).findFirst().get(),
f))
- .forEach(fieldsPair -> newRecord.set(fieldsPair.getKey(),
values[fieldsPair.getValue().pos()]));
- records.add(newRecord.build());
+ return Arrays.stream(splits).parallel().flatMap(split -> {
Review Comment:
We are testing over small files, split across multiple partitions so reading
in parallel should not cause memory issues.
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableCompaction.java:
##########
@@ -183,59 +180,53 @@ public void testWriteDuringCompaction(String
payloadClass, HoodieIndex.IndexType
@ParameterizedTest
@MethodSource("writeLogTest")
public void testWriteLogDuringCompaction(boolean enableMetadataTable,
boolean enableTimelineServer) throws IOException {
- try {
- //disable for this test because it seems like we process mor in a
different order?
-
jsc().hadoopConfiguration().set(HoodieReaderConfig.FILE_GROUP_READER_ENABLED.key(),
"false");
Review Comment:
Disabling MDT is not needed
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/TestHoodieClientMultiWriter.java:
##########
@@ -457,7 +457,7 @@ private void
testHoodieClientBasicMultiWriterWithEarlyConflictDetection(String t
setUpMORTestTable();
}
- int heartBeatIntervalForCommit4 = 10 * 1000;
+ int heartBeatIntervalForCommit4 = 3 * 1000;
Review Comment:
This is controlling a sleep interval which was causing the test to run for a
minimum of 30 seconds
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -251,7 +251,12 @@ public void testMetadataTableBootstrap(HoodieTableType
tableType, boolean addRol
// bootstrap with few commits
doPreBootstrapOperations(testTable);
- writeConfig = getWriteConfig(true, true);
+ writeConfig = getWriteConfigBuilder(true, true, false)
Review Comment:
Changes to this file revolve around skipping the col-stats index where
possible since the TestTable will not write real files causing a lot of errors
or warnings in the logs which makes it hard to debug CI issues
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieFileSystemViews.java:
##########
@@ -91,6 +91,10 @@ public static List<Arguments> tableTypeMetadataFSVTypeArgs()
{
for (HoodieTableType tableType : HoodieTableType.values()) {
for (boolean enableMdt : Arrays.asList(true, false)) {
for (FileSystemViewStorageType viewStorageType :
Arrays.asList(FileSystemViewStorageType.MEMORY,
FileSystemViewStorageType.SPILLABLE_DISK)) {
+ if (!enableMdt && viewStorageType ==
FileSystemViewStorageType.MEMORY) {
+ // This is the baseline case, no need to test here.
Review Comment:
The test is responsible for comparing an in-memory view without MDT to
another view so this case was essentially comparing the view to itself which
just cost test time.
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestConsistentBucketIndex.java:
##########
@@ -113,7 +113,7 @@ private void setUp(boolean populateMetaFields, boolean
partitioned) throws Excep
.withIndexType(HoodieIndex.IndexType.BUCKET)
.withIndexKeyField("_row_key")
.withBucketIndexEngineType(HoodieIndex.BucketIndexEngineType.CONSISTENT_HASHING)
- .withBucketNum("8")
+ .withBucketNum("4")
Review Comment:
8 buckets across 3 partitions was causing some unnecessary overhead. This
helps improve test runtime
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestRecordLevelIndex.scala:
##########
@@ -204,8 +204,9 @@ class TestRecordLevelIndex extends RecordLevelIndexTestBase
with SparkDatasetMix
val holder = new testRecordLevelIndexHolder
testRecordLevelIndex(testCase.tableType, testCase.streamingWriteEnabled,
holder)
val writeConfig = getWriteConfig(holder.options)
- new SparkRDDWriteClient(new HoodieSparkEngineContext(jsc), writeConfig)
-
.rollback(metaClient.getActiveTimeline.lastInstant().get().requestedTime())
+ val writeClient = new SparkRDDWriteClient(new
HoodieSparkEngineContext(jsc), writeConfig)
+
writeClient.rollback(metaClient.getActiveTimeline.lastInstant().get().requestedTime())
+ writeClient.close()
Review Comment:
changes to this file are to make sure clients are always closed
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMetadataTableWithSparkDataSource.scala:
##########
@@ -121,7 +121,7 @@ class TestMetadataTableWithSparkDataSource extends
SparkClientFunctionalTestHarn
// Files partition of MT
val filesPartitionDF =
spark.read.format(hudi).load(s"$basePath/.hoodie/metadata/files")
// Smoke test
- filesPartitionDF.show()
+ assertEquals(4, filesPartitionDF.collect().length)
Review Comment:
Avoid `show` in this file to cut down on noise in the CI logs
##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java:
##########
@@ -621,12 +621,6 @@ protected void validateFilesPerPartition(HoodieTestTable
testTable,
tableView.getAllFileGroups(partition).collect(Collectors.toList());
fileGroups.addAll(tableView.getAllReplacedFileGroups(partition).collect(Collectors.toList()));
- fileGroups.forEach(g -> log.info(g.toString()));
Review Comment:
Logging doesn't actually provide any benefit here, it is just costing CPU
and causing noise
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]