alexeykudinkin commented on code in PR #7413:
URL: https://github.com/apache/hudi/pull/7413#discussion_r1044849194
##########
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/execution/bulkinsert/JavaCustomColumnsSortPartitioner.java:
##########
@@ -40,8 +42,11 @@
private final String[] sortColumnNames;
private final Schema schema;
private final boolean consistentLogicalTimestampEnabled;
+ private final boolean isSorted;
- public JavaCustomColumnsSortPartitioner(String[] columnNames, Schema schema,
boolean consistentLogicalTimestampEnabled) {
+ public JavaCustomColumnsSortPartitioner(String[] columnNames, Schema schema,
boolean consistentLogicalTimestampEnabled, HoodieWriteConfig config) {
+ String partitionPath =
config.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME);
Review Comment:
Ha! My mind completely ignored Spark in the name. Conceptually though this
factory should be generic and there's no reason for it to be bound to Spark.
##########
hudi-client/hudi-java-client/src/test/java/org/apache/hudi/execution/bulkinsert/TestJavaBulkInsertInternalPartitioner.java:
##########
@@ -63,9 +65,11 @@ public void testCustomColumnSortPartitioner(String
sortColumnString) throws Exce
getCustomColumnComparator(HoodieTestDataGenerator.AVRO_SCHEMA,
sortColumns);
List<HoodieRecord> records = generateTestRecordsForBulkInsert(1000);
+ HoodieWriteConfig cfg =
HoodieWriteConfig.newBuilder().withPath("basePath").build();
+ cfg.setValue(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME,
"partition_path");
testBulkInsertInternalPartitioner(
- new JavaCustomColumnsSortPartitioner(sortColumns,
HoodieTestDataGenerator.AVRO_SCHEMA, false),
- records, true, generatePartitionNumRecords(records),
Option.of(columnComparator));
+ new JavaCustomColumnsSortPartitioner(sortColumns,
HoodieTestDataGenerator.AVRO_SCHEMA, false, cfg),
+ records, false, generatePartitionNumRecords(records),
Option.of(columnComparator));
Review Comment:
Let's also add/modify a regression test adding a check verifying number of
files created (w/o your change there was a possibility that multiple files
could be created w/in a single partition)
##########
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/execution/bulkinsert/JavaCustomColumnsSortPartitioner.java:
##########
@@ -40,8 +42,11 @@
private final String[] sortColumnNames;
private final Schema schema;
private final boolean consistentLogicalTimestampEnabled;
+ private final boolean isSorted;
- public JavaCustomColumnsSortPartitioner(String[] columnNames, Schema schema,
boolean consistentLogicalTimestampEnabled) {
+ public JavaCustomColumnsSortPartitioner(String[] columnNames, Schema schema,
boolean consistentLogicalTimestampEnabled, HoodieWriteConfig config) {
+ String partitionPath =
config.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME);
Review Comment:
For now then, let's extract the utility parsing partition-path fields from
`ComplexKeyGenerator`
##########
hudi-client/hudi-java-client/src/test/java/org/apache/hudi/execution/bulkinsert/TestJavaBulkInsertInternalPartitioner.java:
##########
@@ -63,9 +65,11 @@ public void testCustomColumnSortPartitioner(String
sortColumnString) throws Exce
getCustomColumnComparator(HoodieTestDataGenerator.AVRO_SCHEMA,
sortColumns);
List<HoodieRecord> records = generateTestRecordsForBulkInsert(1000);
+ HoodieWriteConfig cfg =
HoodieWriteConfig.newBuilder().withPath("basePath").build();
+ cfg.setValue(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME,
"partition_path");
testBulkInsertInternalPartitioner(
- new JavaCustomColumnsSortPartitioner(sortColumns,
HoodieTestDataGenerator.AVRO_SCHEMA, false),
- records, true, generatePartitionNumRecords(records),
Option.of(columnComparator));
+ new JavaCustomColumnsSortPartitioner(sortColumns,
HoodieTestDataGenerator.AVRO_SCHEMA, false, cfg),
+ records, false, generatePartitionNumRecords(records),
Option.of(columnComparator));
Review Comment:
We actually shouldn't be flipping this boolean b/c records are still sorted
-- just in a different way which is actually defined by the comparator above
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]