[jira] [Updated] (HUDI-3091) Make simple index as the default hoodie.index.type

sivabalan narayanan (Jira) Tue, 01 Feb 2022 07:08:06 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-3091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sivabalan narayanan updated HUDI-3091:
--------------------------------------
    Description: 
When performing upserts with derived datasets, we often run into an OOM issue 
with the bloom filter, hence we changed all the dataset index types to simple 
to resolve the issue.

 

Some of the tables were non-partitioned tables for which bloom index is not the 
right choice.

I'm proposing to make a simple index as the default value and on case-by-case 
basics, folks can choose the bloom filter for additional performance gains 
offered by bloom filters.

 

I agree that the performance will not be optimal but for regular use cases 
simple index would not break and give them sub-optimal read/write performance 
but it won't break any ingestion/derived jobs.

 

 

Tests to validate the flip:

Trigger some ingestions (either spark datasource or deltastreamer) with record 
keys having some timestamp characteristics. 

Updates 5 to 10%. 

Dataset size: 100GB. 

measure index look up time across bloom index and simple index. 

 

 

 

 

  was:
When performing upserts with derived datasets, we often run into an OOM issue 
with the bloom filter, hence we changed all the dataset index types to simple 
to resolve the issue.

 

Some of the tables were non-partitioned tables for which bloom index is not the 
right choice.

I'm proposing to make a simple index as the default value and on case-by-case 
basics, folks can choose the bloom filter for additional performance gains 
offered by bloom filters.

 

I agree that the performance will not be optimal but for regular use cases 
simple index would not break and give them sub-optimal read/write performance 
but it won't break any ingestion/derived jobs.

 

 

 


> Make simple index as the default hoodie.index.type
> --------------------------------------------------
>
>                 Key: HUDI-3091
>                 URL: https://issues.apache.org/jira/browse/HUDI-3091
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: index
>            Reporter: Vinoth Govindarajan
>            Assignee: sivabalan narayanan
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When performing upserts with derived datasets, we often run into an OOM issue 
> with the bloom filter, hence we changed all the dataset index types to simple 
> to resolve the issue.
>  
> Some of the tables were non-partitioned tables for which bloom index is not 
> the right choice.
> I'm proposing to make a simple index as the default value and on case-by-case 
> basics, folks can choose the bloom filter for additional performance gains 
> offered by bloom filters.
>  
> I agree that the performance will not be optimal but for regular use cases 
> simple index would not break and give them sub-optimal read/write performance 
> but it won't break any ingestion/derived jobs.
>  
>  
> Tests to validate the flip:
> Trigger some ingestions (either spark datasource or deltastreamer) with 
> record keys having some timestamp characteristics. 
> Updates 5 to 10%. 
> Dataset size: 100GB. 
> measure index look up time across bloom index and simple index. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3091) Make simple index as the default hoodie.index.type

Reply via email to