[jira] [Updated] (HUDI-4071) Better Spark Datasource default configs

Sagar Sumit (Jira) Wed, 11 May 2022 23:31:49 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sagar Sumit updated HUDI-4071:
------------------------------
    Description: 
Default configs should be:
 # 
 ## Optimized for insert/bulk_insert e.g. by default if we have NONE sort mode 
then it's as good as parquet writes with some additional work for meta columns. 
An extension of this is to keep a map of minimal optimized configs per 
operation type. This is partly related to better performant configs HUDI-2151
 ## Make reasonable assumptions, e.g. for index type, bloom filter does not 
rely on any external system, so it can be a better default candidate than let's 
say HBase index.
 ## Scout all configs with noDefaultValue and assign a default if necessary.
 ## Keep spark-sql and spark datasource config keys same as much as possible, 
otherwise it's difficult operationally for the user. Rename/reuse existing 
datasource keys that are meant for same purpose. This is related to HUDI-4070 
as well.

> Better Spark Datasource default configs
> ---------------------------------------
>
>                 Key: HUDI-4071
>                 URL: https://issues.apache.org/jira/browse/HUDI-4071
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Priority: Major
>
> Default configs should be:
>  # 
>  ## Optimized for insert/bulk_insert e.g. by default if we have NONE sort 
> mode then it's as good as parquet writes with some additional work for meta 
> columns. An extension of this is to keep a map of minimal optimized configs 
> per operation type. This is partly related to better performant configs 
> HUDI-2151
>  ## Make reasonable assumptions, e.g. for index type, bloom filter does not 
> rely on any external system, so it can be a better default candidate than 
> let's say HBase index.
>  ## Scout all configs with noDefaultValue and assign a default if necessary.
>  ## Keep spark-sql and spark datasource config keys same as much as possible, 
> otherwise it's difficult operationally for the user. Rename/reuse existing 
> datasource keys that are meant for same purpose. This is related to HUDI-4070 
> as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HUDI-4071) Better Spark Datasource default configs

Reply via email to