wuwenchi commented on code in PR #8300:
URL: https://github.com/apache/hudi/pull/8300#discussion_r1161198464
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDCustomColumnsSortPartitioner.java:
##########
@@ -59,17 +62,17 @@ public JavaRDD<HoodieRecord<T>>
repartitionRecords(JavaRDD<HoodieRecord<T>> reco
final String[] sortColumns = this.sortColumnNames;
final SerializableSchema schema = this.serializableSchema;
final boolean consistentLogicalTimestampEnabled =
this.consistentLogicalTimestampEnabled;
- return records.sortBy(
- record -> {
- Object recordValue = record.getColumnValues(schema.get(),
sortColumns, consistentLogicalTimestampEnabled);
- // null values are replaced with empty string for null_first order
- if (recordValue == null) {
- return StringUtils.EMPTY_STRING;
- } else {
- return StringUtils.objToString(recordValue);
- }
- },
Review Comment:
If there are multiple sorting fields specified by the user, then the
original situation is that there will be two palces:
1.
https://github.com/apache/hudi/blob/3cc6233b58773d45a8726f70a75c6d1edda7b313/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java#L761-L763
Extract the fields specified in the record and concatenate them into a
string. (This is wrong because multi-field sorting is to sort by one field
first, and then sort by another field, instead of splicing the contents of
multiple fields together and then sorting)
2
https://github.com/apache/hudi/blob/3cc6233b58773d45a8726f70a75c6d1edda7b313/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java#L119-L121
`getRecordColumnValues` returns an `Object` (actually a string), but
`getColumnValues` is forcibly replaced with `Object[]`, and in
`repartitionRecords` it is forcibly converted back to an `Object`, and then
directly fetches toString for the `Object`, resulting in the fact that the
strings compared here are actually is the object address
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]