[GitHub] [hudi] prashantwason commented on a change in pull request #1687: [WIP] [HUDI-684] Introduced abstraction for writing and reading different types of base file formats.

GitBox Thu, 04 Jun 2020 12:17:51 -0700


prashantwason commented on a change in pull request #1687:
URL: https://github.com/apache/hudi/pull/1687#discussion_r435493796




##########
File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##########
@@ -133,6 +136,12 @@ private void init(String fileId, String partitionPath, 
HoodieBaseFile dataFileTo
       // Create the writer for writing the new version file
       storageWriter =
           HoodieStorageWriterFactory.getStorageWriter(instantTime, 
newFilePath, hoodieTable, config, writerSchema, sparkTaskContextSupplier);
+
+      if (hoodieTable.requireSortedRecords()) {

Review comment:
       We already sort using RDD.sortBy() 
   
   (see WriteHandle.java) 
   taggedRecords.sortBy(r -> r.getRecordKey(), true, 
taggedRecords.getNumPartitions());
   
   Merging has some additional complication:
   
   During merging, we are reading existing records (which should be sorted in 
HFile) and updating a "few" of them. Then we are writing all of them back to a 
new HFile. So effectively, we have two sorted list of records:
   1. The records being read from existing HFile (sorted in last write)
   2. The records being updated (sorted in WriteHandle code)
   
   We can do it in three steps but this will require a large amount of memory 
or an ExternalSpillableMap:
   1. Read all records from existing HFile
   2. Apply updates in-memory
   3. Write to new HFile
   
   The way I have implemented is like a merge-sort which does not require 
reading all the records from the HFile before applying updates.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] prashantwason commented on a change in pull request #1687: [WIP] [HUDI-684] Introduced abstraction for writing and reading different types of base file formats.

Reply via email to