[
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062761#comment-17062761
]
lamber-ken commented on HUDI-686:
---------------------------------
[~vinoth] thanks for bring up this new idea. here are some concerns to consider:
1. +candidates+ may cause OOM, although we can increase the num of partitions
to solve it. that may will impact the user's experience, because use
need think about it.
{quote}List<Pair<HoodieRecord<T>, String>> candidates = new ArrayList<>();
{quote}
2. +fileIDToBloomFilter+ is an external map that spills content to disk, we
need to think about the seri / dese performance
{quote}this.fileIDToBloomFilter = new ExternalSpillableMap<>(1000000000L
...)BloomFilter filter =
fileIDToBloomFilter.get(partitionFileIdPair.getRight());
{quote}
[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
protected List<Pair<HoodieRecord<T>, String>> computeNext() {
List<Pair<HoodieRecord<T>, String>> candidates = new ArrayList<>();
if (inputItr.hasNext()) {
HoodieRecord<T> record = inputItr.next();
try {
initIfNeeded(record.getPartitionPath());
} catch (IOException e) {
throw new HoodieIOException(
"Error reading index metadata for " + record.getPartitionPath(), e);
}
indexFileFilter
.getMatchingFilesAndPartition(record.getPartitionPath(),
record.getRecordKey())
.forEach(partitionFileIdPair -> {
BloomFilter filter =
fileIDToBloomFilter.get(partitionFileIdPair.getRight());
if (filter.mightContain(record.getRecordKey())) {
candidates.add(Pair.of(record, partitionFileIdPair.getRight()));
}
});
if (candidates.size() == 0) {
candidates.add(Pair.of(record, ""));
}
}
return candidates;
}
{code}
> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Index, Performance
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Fix For: 0.6.0
>
>
> Main goals here is to provide a much simpler index, without advanced
> optimizations like auto tuned parallelism/skew handling but a better
> out-of-experience for small workloads.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)