wuwenchi commented on code in PR #8300:
URL: https://github.com/apache/hudi/pull/8300#discussion_r1161198464


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDCustomColumnsSortPartitioner.java:
##########
@@ -59,17 +62,17 @@ public JavaRDD<HoodieRecord<T>> 
repartitionRecords(JavaRDD<HoodieRecord<T>> reco
     final String[] sortColumns = this.sortColumnNames;
     final SerializableSchema schema = this.serializableSchema;
     final boolean consistentLogicalTimestampEnabled = 
this.consistentLogicalTimestampEnabled;
-    return records.sortBy(
-        record -> {
-          Object recordValue = record.getColumnValues(schema.get(), 
sortColumns, consistentLogicalTimestampEnabled);
-          // null values are replaced with empty string for null_first order
-          if (recordValue == null) {
-            return StringUtils.EMPTY_STRING;
-          } else {
-            return StringUtils.objToString(recordValue);
-          }
-        },

Review Comment:
   
   If there are multiple sorting fields specified by the user, then the 
original situation is that there will be two palces:
   1. 
    
https://github.com/apache/hudi/blob/3cc6233b58773d45a8726f70a75c6d1edda7b313/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java#L761-L763
   Extract the fields specified in the record and concatenate them into a 
string. (This is wrong  because multi-field sorting is to sort by one field 
first, and then sort by another field, instead of splicing the contents of 
multiple fields together and then sorting)
   
   2
   
   
https://github.com/apache/hudi/blob/3cc6233b58773d45a8726f70a75c6d1edda7b313/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieAvroRecord.java#L119-L121
   `getRecordColumnValues` returns an `Object` (actually a string), but 
`getColumnValues` is forcibly replaced with `Object[]`, and in 
`repartitionRecords` it is forcibly converted back to an `Object`, and then 
directly fetches toString for the `Object`, resulting in the fact that the 
strings compared here are actually is the object address



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to