manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r796980301
##########
File path:
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##########
@@ -134,22 +139,41 @@ public BloomFilter readBloomFilter() {
}
@Override
- public Set<String> filterRowKeys(Set candidateRowKeys) {
- // Current implementation reads all records and filters them. In certain
cases, it many be better to:
- // 1. Scan a limited subset of keys (min/max range of candidateRowKeys)
- // 2. Lookup keys individually (if the size of candidateRowKeys is much
less than the total keys in file)
- try {
- List<Pair<String, R>> allRecords = readAllRecords();
- Set<String> rowKeys = new HashSet<>();
- allRecords.forEach(t -> {
- if (candidateRowKeys.contains(t.getFirst())) {
- rowKeys.add(t.getFirst());
- }
- });
- return rowKeys;
- } catch (IOException e) {
- throw new HoodieIOException("Failed to read row keys from " + path, e);
+ public Set<String> filterRowKeys(Set<String> candidateRowKeys) {
+ return candidateRowKeys.stream().filter(k -> {
+ try {
+ return isKeyAvailable(k);
Review comment:
Right, HFile has more performant seekTo() api where we can check the
avialabilty of the key. We don't need to fetch the records back here.
Added the java doc on sorted keys performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]