Davis-Zhang-Onehouse commented on code in PR #13489:
URL: https://github.com/apache/hudi/pull/13489#discussion_r2178382296
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -277,61 +268,291 @@ private Predicate
transformKeyPrefixesToPredicate(List<String> keyPrefixes) {
return Predicates.startsWithAny(null, right);
}
- @Override
- protected Map<String, HoodieRecord<HoodieMetadataPayload>>
getRecordsByKeys(List<String> keys, String partitionName) {
- if (keys.isEmpty()) {
- return Collections.emptyMap();
+ private HoodiePairData<String, HoodieRecord<HoodieMetadataPayload>>
doLookupWithMapping(
+ HoodieData<String> keys, String partitionName, List<FileSlice>
fileSlices, boolean isSecondaryIndex,
+ Option<SerializableFunction<String, String>> keyEncodingFn) {
+
+ final int numFileSlices = fileSlices.size();
+ if (numFileSlices == 1) {
+ TreeSet<String> distinctSortedKeys = new TreeSet<>(keys.collectAsList());
+ return lookupRecordsWithMapping(partitionName, new
ArrayList<>(distinctSortedKeys), fileSlices.get(0), !isSecondaryIndex,
keyEncodingFn);
}
- Map<String, HoodieRecord<HoodieMetadataPayload>> result;
+ SerializableBiFunction<String, Integer, Integer> mappingFunction =
HoodieTableMetadataUtil::mapRecordKeyToFileGroupIndex;
Review Comment:
Danny this is another tech debt and I would like to hear from you and vc
the hash function needs to check:
- index type
- index definition
Then for SI v2, we have 2 cases requires different implementation:
- SI write path concatenate secKey$recordKey, but we only hash on secKey, so
we need to extract secKey, then hash it
- SI read path give secKey only, we don't need to extract
I don't want to give 1 hash func and check for unescaped $ and decide which
case it is. Since hash func is applied per record, looking for unescaped $
impacts perf non-trivially. That's why there is a 3rd and ugly parameter
"shouldExtractSecondaryKeyForHashing" where caller directly tells if it is read
case or write case.
The function must take some context.
Overall I want all the place to use the same getHashFunction consistently.
It should consider index version and MDT partition type and extra context to
decide. I want to avoid hard coded hash function implementation.
With this context, in commit "fix hash function" I fixed broken
implementation. I can wrap shouldExtractSecondaryKeyForHashing in a context.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]