[
https://issues.apache.org/jira/browse/HUDI-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-3091:
--------------------------------------
Sprint: Cont' improve - 2021/01/24, Cont' improve - 2021/01/31, Cont'
improve - 2022/02/07 (was: Cont' improve - 2021/01/24, Cont' improve -
2021/01/31)
> Make simple index as the default hoodie.index.type
> --------------------------------------------------
>
> Key: HUDI-3091
> URL: https://issues.apache.org/jira/browse/HUDI-3091
> Project: Apache Hudi
> Issue Type: New Feature
> Components: index
> Reporter: Vinoth Govindarajan
> Assignee: sivabalan narayanan
> Priority: Critical
> Labels: pull-request-available
> Fix For: 0.11.0
>
> Original Estimate: 1h
> Time Spent: 2h
> Remaining Estimate: 0h
>
> When performing upserts with derived datasets, we often run into an OOM issue
> with the bloom filter, hence we changed all the dataset index types to simple
> to resolve the issue.
>
> Some of the tables were non-partitioned tables for which bloom index is not
> the right choice.
> I'm proposing to make a simple index as the default value and on case-by-case
> basics, folks can choose the bloom filter for additional performance gains
> offered by bloom filters.
>
> I agree that the performance will not be optimal but for regular use cases
> simple index would not break and give them sub-optimal read/write performance
> but it won't break any ingestion/derived jobs.
>
>
> Tests to validate the flip:
> Trigger some ingestions (either spark datasource or deltastreamer) with
> record keys having some timestamp characteristics.
> Updates 5 to 10%.
> Dataset size: 100GB.
> measure index look up time across bloom index and simple index.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)