Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

via GitHub Tue, 03 Oct 2023 19:43:17 -0700


codope commented on code in PR #9581:
URL: https://github.com/apache/hudi/pull/9581#discussion_r1344501029



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDeleteBlock.java:
##########
@@ -65,17 +69,44 @@ public class HoodieDeleteBlock extends HoodieLogBlock {
   private static final Lazy<HoodieDeleteRecord.Builder> 
HOODIE_DELETE_RECORD_BUILDER_STUB =
       Lazy.lazily(HoodieDeleteRecord::newBuilder);
 
+  private final boolean writeRecordPositions;
+  // Records to delete, sorted based on the record position if writing record 
position to the log block header
   private DeleteRecord[] recordsToDelete;
 
-  public HoodieDeleteBlock(DeleteRecord[] recordsToDelete, 
Map<HeaderMetadataType, String> header) {
-    this(Option.empty(), null, false, Option.empty(), header, new HashMap<>());
-    this.recordsToDelete = recordsToDelete;
+  public HoodieDeleteBlock(List<Pair<DeleteRecord, Long>> recordsToDelete,
+                           boolean writeRecordPositions,
+                           Map<HeaderMetadataType, String> header) {
+    this(Option.empty(), null, false, Option.empty(), header, new HashMap<>(), 
writeRecordPositions);
+    if (writeRecordPositions) {
+      recordsToDelete.sort((o1, o2) -> {
+        long v1 = o1.getRight();
+        long v2 = o2.getRight();
+        return Long.compare(v1, v2);
+      });
+      if (recordsToDelete.get(0).getRight() > -1L) {
+        addRecordPositionsToHeader(

Review Comment:
   record position can be invalid (-1) when:
   
   1. current location is not known (new inserts going to log file).
   2. when base format is hfile (record position not supported for hfile).



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -89,10 +90,11 @@ public class HoodieAppendHandle<T, I, K, O> extends 
HoodieWriteHandle<T, I, K, O
   private static final int NUMBER_OF_RECORDS_TO_ESTIMATE_RECORD_SIZE = 100;
 
   protected final String fileId;
+  private final boolean writeRecordPositions;

Review Comment:
   Renamed and moved to a level above in `HoodieWriteConfig`.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java:
##########
@@ -70,14 +75,31 @@ public abstract class HoodieDataBlock extends 
HoodieLogBlock {
    * NOTE: This ctor is used on the write-path (ie when records ought to be 
written into the log)
    */
   public HoodieDataBlock(List<HoodieRecord> records,
+                         boolean writeRecordPositions,
                          Map<HeaderMetadataType, String> header,
                          Map<HeaderMetadataType, String> footer,
                          String keyFieldName) {
     super(header, footer, Option.empty(), Option.empty(), null, false);
+    if (writeRecordPositions) {
+      records.sort((o1, o2) -> {

Review Comment:
   I've enabled writing record positions for some of the existing tests which 
includes the following scenarios:
   1. MOR table with compaction.
   2. MOR table with clustering.
   3. MOR table with clustering and no base file (before clustering).
   4. COW/MOR table with different index types.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -173,18 +173,18 @@ public static <R> HoodieRecord<R> 
tagRecord(HoodieRecord<R> record, HoodieRecord
    *
    * @param filePath            - File to filter keys from
    * @param candidateRecordKeys - Candidate keys to filter
-   * @return List of candidate keys that are available in the file
+   * @return List of pairs of candidate keys and positions that are available 
in the file
    */
-  public static List<String> filterKeysFromFile(Path filePath, List<String> 
candidateRecordKeys,
-                                                Configuration configuration) 
throws HoodieIndexException {
+  public static List<Pair<String, Long>> filterKeysFromFile(Path filePath, 
List<String> candidateRecordKeys,
+                                                            Configuration 
configuration) throws HoodieIndexException {
     ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath));
-    List<String> foundRecordKeys = new ArrayList<>();
+    List<Pair<String, Long>> foundRecordKeys = new ArrayList<>();
     try (HoodieFileReader fileReader = 
HoodieFileReaderFactory.getReaderFactory(HoodieRecordType.AVRO)
         .getFileReader(configuration, filePath)) {
       // Load all rowKeys from the file, to double-confirm
       if (!candidateRecordKeys.isEmpty()) {
         HoodieTimer timer = HoodieTimer.start();
-        Set<String> fileRowKeys = fileReader.filterRowKeys(new 
TreeSet<>(candidateRecordKeys));
+        Set<Pair<String, Long>> fileRowKeys = fileReader.filterRowKeys(new 
TreeSet<>(candidateRecordKeys));

Review Comment:
   Keys should be sorted for HFile reader. For others, it doesn't matter as 
long as it is a set (for efficient contains check). Incorporated above 
suggestion - here we just pass a set and init `SortedSet` in 
`HoodieHFileAvroReader` only instead of building `TreeSet` for every other 
format.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-6795] Implement writing record_positions to log blocks for updates and deletes [hudi]

Reply via email to