Re: [PR] [HUDI-9505] Index lookup feature [hudi]

via GitHub Wed, 02 Jul 2025 22:02:44 -0700


Davis-Zhang-Onehouse commented on code in PR #13489:
URL: https://github.com/apache/hudi/pull/13489#discussion_r2181732288



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -277,61 +268,291 @@ private Predicate 
transformKeyPrefixesToPredicate(List<String> keyPrefixes) {
     return Predicates.startsWithAny(null, right);
   }
 
-  @Override
-  protected Map<String, HoodieRecord<HoodieMetadataPayload>> 
getRecordsByKeys(List<String> keys, String partitionName) {
-    if (keys.isEmpty()) {
-      return Collections.emptyMap();
+  private HoodiePairData<String, HoodieRecord<HoodieMetadataPayload>> 
doLookupWithMapping(
+          HoodieData<String> keys, String partitionName, List<FileSlice> 
fileSlices, boolean isSecondaryIndex,
+          Option<SerializableFunction<String, String>> keyEncodingFn) {
+
+    final int numFileSlices = fileSlices.size();
+    if (numFileSlices == 1) {
+      TreeSet<String> distinctSortedKeys = new TreeSet<>(keys.collectAsList());
+      return lookupRecordsWithMapping(partitionName, new 
ArrayList<>(distinctSortedKeys), fileSlices.get(0), !isSecondaryIndex, 
keyEncodingFn);
     }
 
-    Map<String, HoodieRecord<HoodieMetadataPayload>> result;
+    SerializableBiFunction<String, Integer, Integer> mappingFunction = 
HoodieTableMetadataUtil::mapRecordKeyToFileGroupIndex;

Review Comment:
   Hi Danny, regarding this commit 
https://github.com/apache/hudi/pull/13489/commits/d945e419cd07c0160dfe9ff60cca29a233eb9eea,
 let's chat when you get a chance.
   
   SI v2 hash func, unlike other indexes, needs to handle 2 types of input:
   1. sec$rec (this is what write path will give, in func like tagLocation) - 
we need to extract sec out and compute hash
   2. sec (this is what read path will give, in func lie readSecondaryIndex) - 
we can compute hash directly
   
   3 factors (is SI, is v2, is case 1 or case 2)
   
   This knowledge is known via either ways:
   1. from the call site, the caller knows which case it is handling, so it can 
choose to give hard coded "true"/"false" to indicate this fact
        - this is [my previous 
implementation](https://github.com/apache/hudi/pull/13489/commits/a16f8a4e456127711e87a0312f4684df4c917e30),
 on write path, it knows it is case 1, so we only need to see if it is v2 and 
it is SI. On read path, same for read path, we know it can only be case 2, so 
we only need to check if it is v2 and SI.
   2. or we explicitly checking each key, if there is an unescaped $, then it 
is case 1, else it is case 2
      - this incur some overhead on read path. That's why I didn't go with this 
route.
   
   There can never be a case where there is a mix keys from the 2 cases we need 
to handle. So the caller only need to decide once before apply it over all the 
given keys.
   
   
   
   you commit 
https://github.com/apache/hudi/pull/13489/commits/d945e419cd07c0160dfe9ff60cca29a233eb9eea
 implement the above logic in a wrong way. As we only consider the first 2 
factor (is SI, is v2), but the 3rd factor is missing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9505] Index lookup feature [hudi]

Reply via email to