Re: [PR] [HUDI-9505] Index lookup feature [hudi]

via GitHub Tue, 01 Jul 2025 18:06:50 -0700


Davis-Zhang-Onehouse commented on code in PR #13489:
URL: https://github.com/apache/hudi/pull/13489#discussion_r2178382296



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -277,61 +268,291 @@ private Predicate 
transformKeyPrefixesToPredicate(List<String> keyPrefixes) {
     return Predicates.startsWithAny(null, right);
   }
 
-  @Override
-  protected Map<String, HoodieRecord<HoodieMetadataPayload>> 
getRecordsByKeys(List<String> keys, String partitionName) {
-    if (keys.isEmpty()) {
-      return Collections.emptyMap();
+  private HoodiePairData<String, HoodieRecord<HoodieMetadataPayload>> 
doLookupWithMapping(
+          HoodieData<String> keys, String partitionName, List<FileSlice> 
fileSlices, boolean isSecondaryIndex,
+          Option<SerializableFunction<String, String>> keyEncodingFn) {
+
+    final int numFileSlices = fileSlices.size();
+    if (numFileSlices == 1) {
+      TreeSet<String> distinctSortedKeys = new TreeSet<>(keys.collectAsList());
+      return lookupRecordsWithMapping(partitionName, new 
ArrayList<>(distinctSortedKeys), fileSlices.get(0), !isSecondaryIndex, 
keyEncodingFn);
     }
 
-    Map<String, HoodieRecord<HoodieMetadataPayload>> result;
+    SerializableBiFunction<String, Integer, Integer> mappingFunction = 
HoodieTableMetadataUtil::mapRecordKeyToFileGroupIndex;

Review Comment:
   Danny this is another tech debt and I would like to hear from you and vc
   
   the hash function needs to check:
   - index type
   - index definition
   
   Then for SI v2, we have 2 cases requires different implementation:
   - SI write path concatenate secKey$recordKey, but we only hash on secKey, so 
we need to extract secKey, then hash it
   - SI read path give secKey only, we don't need to extract
   
   I don't want to give 1 hash func and check for unescaped $ and decide which 
case it is. Since hash func is applied per record, looking for unescaped $ 
impacts perf non-trivially. That's why there is a 3rd and ugly parameter 
"shouldExtractSecondaryKeyForHashing" where caller directly tells if it is read 
case or write case.
   
   The function must take some context. 
   
   Overall I want all the place to use the same getHashFunction consistently. 
It should consider index version and MDT partition type and extra context to 
decide. I want to avoid hard coded hash function implementation.
   
   With this context, in commit "fix hash function" I fixed broken 
implementation. I can wrap shouldExtractSecondaryKeyForHashing in a context. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9505] Index lookup feature [hudi]

Reply via email to