prashantwason commented on a change in pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#discussion_r733446726
##########
File path:
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##########
@@ -120,65 +120,114 @@ private void initIfNeeded() {
}
@Override
- protected Option<HoodieRecord<HoodieMetadataPayload>>
getRecordByKeyFromMetadata(String key, String partitionName) {
- Pair<HoodieFileReader, HoodieMetadataMergedLogRecordScanner> readers =
openReadersIfNeeded(key, partitionName);
+ protected Option<HoodieRecord<HoodieMetadataPayload>> getRecordByKey(String
key, String partitionName) {
+ return getRecordsByKeys(Collections.singletonList(key),
partitionName).get(0).getValue();
+ }
+
+ protected List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>>
getRecordsByKeys(List<String> keys, String partitionName) {
Review comment:
One requirement with record-level index will be the ability to full scan
and point lookup on demand. E.g.: Assuming lookup of small number of keys from
record index (tagLocation), we will like to perform inline read. But if we want
to read larger number of keys (e.g. a validator tool or large backfill), full
scan may be better.
So the code is enforcing either inline-read or full scan for all partitions
of the metadata table. Full scan will most probably not work for any decent
size dataset because of millions of record-level-index entries in the log files
(from last ingestion before compaction). The overhead of reading all these
entries will be very high.
Some options I can think of:
1. Instead of a metadata-config option for inline/fullscan, why not pass
this info when readXXXFromMetadataTable is called? The caller has good context
of which will be better option - fullscan or inline seek based.
2. Decide automatically based on the number of keys passed in (wrt the
number of keys in the hfile)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]