prashantwason commented on a change in pull request #1687:
URL: https://github.com/apache/hudi/pull/1687#discussion_r435493796
##########
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##########
@@ -133,6 +136,12 @@ private void init(String fileId, String partitionPath,
HoodieBaseFile dataFileTo
// Create the writer for writing the new version file
storageWriter =
HoodieStorageWriterFactory.getStorageWriter(instantTime,
newFilePath, hoodieTable, config, writerSchema, sparkTaskContextSupplier);
+
+ if (hoodieTable.requireSortedRecords()) {
Review comment:
We already sort using RDD.sortBy()
(see WriteHandle.java)
taggedRecords.sortBy(r -> r.getRecordKey(), true,
taggedRecords.getNumPartitions());
Merging has some additional complication:
During merging, we are reading existing records (which should be sorted in
HFile) and updating a "few" of them. Then we are writing all of them back to a
new HFile. So effectively, we have two sorted list of records:
1. The records being read from existing HFile (sorted in last write)
2. The records being updated (sorted in WriteHandle code)
We can do it in three steps but this will require a large amount of memory
or an ExternalSpillableMap:
1. Read all records from existing HFile
2. Apply updates in-memory
3. Write to new HFile
The way I have implemented is like a merge-sort which does not require
reading all the records from the HFile before applying updates.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]