prashantwason commented on a change in pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#discussion_r733446726



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##########
@@ -120,65 +120,114 @@ private void initIfNeeded() {
   }
 
   @Override
-  protected Option<HoodieRecord<HoodieMetadataPayload>> 
getRecordByKeyFromMetadata(String key, String partitionName) {
-    Pair<HoodieFileReader, HoodieMetadataMergedLogRecordScanner> readers = 
openReadersIfNeeded(key, partitionName);
+  protected Option<HoodieRecord<HoodieMetadataPayload>> getRecordByKey(String 
key, String partitionName) {
+    return getRecordsByKeys(Collections.singletonList(key), 
partitionName).get(0).getValue();
+  }
+
+  protected List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> 
getRecordsByKeys(List<String> keys, String partitionName) {

Review comment:
       One requirement with record-level index will be the ability to full scan 
and point lookup on demand. E.g.: Assuming lookup of small number of keys from 
record index (tagLocation), we will like to perform inline read. But if we want 
to read larger number of keys (e.g. a validator tool or large backfill), full 
scan may be better.
   
   So the code is enforcing either inline-read or full scan for all partitions 
of the metadata table. Full scan will most probably not work for any decent 
size dataset because of millions of record-level-index entries in the log files 
(from last ingestion before compaction). The overhead of reading all these 
entries will be very high.
   
   Some options I can think of:
   1. Instead of a metadata-config option for inline/fullscan, why not pass 
this info when readXXXFromMetadataTable is called? The caller has good context 
of which will be better option - fullscan or inline seek based.
   2. Decide automatically based on the number of keys passed in (wrt the 
number of keys in the hfile) 
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to