manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r789072260
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
##########
@@ -101,4 +116,34 @@ public static HoodieRecord getTaggedRecord(HoodieRecord
inputRecord, Option<Hood
}
return record;
}
+
+ /**
+ * Given a list of row keys and one file, return only row keys existing in
that file.
+ *
+ * @param filePath - File to filter keys from
+ * @param candidateRecordKeys - Candidate keys to filter
+ * @return List of candidate keys that are available in the file
+ */
+ public static List<String> filterKeysFromFile(Path filePath, List<String>
candidateRecordKeys,
+ Configuration configuration)
throws HoodieIndexException {
+ ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath));
+ List<String> foundRecordKeys = new ArrayList<>();
+ try {
+ // Load all rowKeys from the file, to double-confirm
+ if (!candidateRecordKeys.isEmpty()) {
+ HoodieTimer timer = new HoodieTimer().startTimer();
+ HoodieFileReader fileReader =
HoodieFileReaderFactory.getFileReader(configuration, filePath);
+ Set<String> fileRowKeys = fileReader.filterKeys(new
TreeSet<>(candidateRecordKeys));
+ foundRecordKeys.addAll(fileRowKeys);
+ LOG.info(String.format("Checked keys against file %s, in %d ms.
#candidates (%d) #found (%d)", filePath,
+ timer.endTimer(), candidateRecordKeys.size(),
foundRecordKeys.size()));
+ if (LOG.isDebugEnabled()) {
Review comment:
You will see this pattern of debug checking and then calling debug log
in many places. This is avoid calling the LOG method whenever the logging
arguments are heavy. Like a very long list of strings, records, files, etc.,
This is to save on memory in constructing the args and improve performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]