[
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068837#comment-17068837
]
Vinoth Chandar commented on HUDI-686:
-------------------------------------
>if the input data is large, need to increase partitions, "candidates" contains
>all datas for per partition
No candidates only contains candidate files per key
>if increase partitions, it will cause duplicate loading of the same
>partition(e.g populateFileIDs() && populateRangeAndBloomFilters())
it will.. That's why we auto tune everything in BloomIndexV1.. but then it
needs some memory caching.. Idea here is to make this work for simpler cases
well and have an option that does not rely on memory caching
> Implement BloomIndexV2 that does not depend on memory caching
> -------------------------------------------------------------
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Index, Performance
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png,
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced
> optimizations like auto tuned parallelism/skew handling but a better
> out-of-experience for small workloads.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)