[
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065030#comment-17065030
]
lamber-ken edited comment on HUDI-686 at 3/24/20, 5:41 AM:
-----------------------------------------------------------
right, this is a nice design, some thoughts:
* if the input data is large, need to increase partitions, "candidates"
contains all datas for per partition
* if increase partitions, it will cause duplicate loading of the same
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())
[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
JavaSparkContext jsc,
HoodieTable<T> hoodieTable) {
return recordRDD.sortBy((record) -> String.format("%s-%s",
record.getPartitionPath(), record.getRecordKey()),
true, config.getBloomIndexV2Parallelism())
.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
.flatMap(List::iterator)
.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
.filter(Option::isPresent)
.map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
}
}{code}
was (Author: lamber-ken):
right, this is a nice design, some thoughts:
* if the input data is large, need to increase partitions, "candidates"
contains all partition datas
* if increase partitions, it will cause duplicate loading of the same
partition(e.g populateFileIDs() && populateRangeAndBloomFilters())
[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
JavaSparkContext jsc,
HoodieTable<T> hoodieTable) {
return recordRDD.sortBy((record) -> String.format("%s-%s",
record.getPartitionPath(), record.getRecordKey()),
true, config.getBloomIndexV2Parallelism())
.mapPartitions((itr) -> new LazyRangeBloomChecker(itr, hoodieTable))
.flatMap(List::iterator)
.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
.filter(Option::isPresent)
.map(Option::get);
}
{code}
{code:java}
private void initIfNeeded(String partitionPath) throws IOException {
if (!Objects.equals(partitionPath, currentPartitionPath)) {
cleanup();
this.currentPartitionPath = partitionPath;
populateFileIDs();
populateRangeAndBloomFilters();
}
}{code}
> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Index, Performance
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png,
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced
> optimizations like auto tuned parallelism/skew handling but a better
> out-of-experience for small workloads.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)