This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7f125b6f310 [DOCS] Clarify release notes on duplicate handling in
Spark SQL and relevant configs (#10680)
7f125b6f310 is described below
commit 7f125b6f3107fba9070f7e2c20fc58fbef564392
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Thu Feb 15 17:05:39 2024 -0800
[DOCS] Clarify release notes on duplicate handling in Spark SQL and
relevant configs (#10680)
---
website/docs/configurations.md | 104 ++++++++++-----------
website/releases/release-0.14.0.md | 8 +-
.../version-0.14.0/configurations.md | 4 +-
.../version-0.14.1/configurations.md | 4 +-
4 files changed, 62 insertions(+), 58 deletions(-)
diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index 01ef8401954..18c3581e305 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -127,59 +127,59 @@ Options useful for writing tables via
`write.format.option(...)`
[**Advanced Configs**](#Write-Options-advanced-configs)
-| Config Name
| Default
| Description
[...]
-|
------------------------------------------------------------------------------------------------------------------------------------------------
| ------------------------------------------------------------ |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-|
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
| (N/A)
| Serde properties to hive table.<br
/>`Config Param: HIVE_TABLE_SERDE_PROPERTIES`
[...]
-|
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
| (N/A)
| Additional properties to store with
table.<br />`Config Param: HIVE_TABLE_PROPERTIES`
[...]
-| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)
| (N/A)
| Controls whether overwrite
use dynamic or static mode, if not configured, respect
spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br
/>`Since Version: 0.14.0`
[...]
-|
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
| (N/A)
| Comma separated list of partitions to
delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE`
[...]
-| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)
| (N/A)
| Table name for the
datasource write. Also used to register the table into meta stores.<br
/>`Config Param: TABLE_NAME`
[...]
-|
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
| true
| Controls whether async
compaction should be turned on for MOR table writing.<br />`Config Param:
ASYNC_COMPACT_ENABLE`
[...]
-|
[hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning)
| false
| Assume partitioning is yyyy/MM/dd<br />`Config Param:
HIVE_ASSUME_DATE_PARTITION`
[...]
-|
[hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database)
| true
| Auto create hive database if does not exists<br
/>`Config Param: HIVE_AUTO_CREATE_DATABASE`
[...]
-|
[hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format)
| PARQUET
| Base file format for the sync.<br
/>`Config Param: HIVE_BASE_FILE_FORMAT`
[...]
-| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num)
| 1000
| The number of partitions
one batch when synchronous partitions to hive.<br />`Config Param:
HIVE_BATCH_SYNC_PARTITION_NUM`
[...]
-|
[hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync)
| false
| Whether sync hive metastore
bucket specification when using bucket index.The specification is 'CLUSTERED BY
(trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param:
HIVE_SYNC_BUCKET_SYNC`
[...]
-|
[hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table)
| false
| Whether to sync the table as managed table.<br
/>`Config Param: HIVE_CREATE_MANAGED_TABLE`
[...]
-| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase)
| default
| The name of the
destination database that we should sync the hudi table to.<br />`Config Param:
HIVE_DATABASE`
[...]
-|
[hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions)
| false
| Ignore exceptions when syncing with
Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS`
[...]
-|
[hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class)
|
org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which
implements PartitionValueExtractor to extract the partition values, default
'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param:
HIVE_PARTITION_EXTRACTOR_CLASS`
[...]
-|
[hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields)
|
| Field in the table to use for
determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS`
[...]
-| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword)
| hive
| hive password to use<br
/>`Config Param: HIVE_PASS`
[...]
-|
[hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix)
| false
| Skip the _ro suffix for Read
optimized table, when registering<br />`Config Param:
HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`
[...]
-|
[hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp)
| false
| ‘INT64’ with original type
TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for
backward compatibility.<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`
[...]
-|
[hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource)
| true
| <br />`Config Param:
HIVE_SYNC_AS_DATA_SOURCE_TABLE`
[...]
-|
[hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment)
| false
| Whether to sync the table
column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT`
[...]
-| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable)
| unknown
| The name of the
destination table that we should sync the hudi table to.<br />`Config Param:
HIVE_TABLE`
[...]
-| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
[...]
-|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package
instead of org.apache.hudi package. Use this when you are in the process of
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you
migrated the table definition to org.apache.hudi input format<br />`Co [...]
-| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
[...]
-| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| When operation type is set
to "insert", users can optionally enforce a dedup policy. This policy will be
employed when records being ingested already exists in storage. Default policy
is none and no action will be taken. Another option is to choose "drop", on
which matching rec [...]
-|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions
like schema change or partition change.<br />`Config Param:
HIVE_CONDITIONAL_SYNC`
[...]
-|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix,
are automatically added to the commit/deltacommit metadata. This is useful to
store checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
[...]
-|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
[...]
-|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. This config is deprecated as of 0.14.0. Please use
hoodie.datasource.insert.dup.policy instead.<br />`Config Param:
INSERT_DROP_DUPS` [...]
-|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config
Param: KEYGENERATOR_CLASS_NAME`
[...]
-|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to
true, consistent value will be generated for a logical timestamp type column,
like timestamp-millis and timestamp-micros, irrespective of whether row-writer
is enabled. Disabled by default so as not to break the pipeline that deploy
either fully row-writer path or non [...]
-|
[hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable)
| false
| When a non-nullable column is added
to datasource during a write operation, the write operation will fail schema
compatibility check. Set this option to true will make the newly added column
nullable to successfully complete the write operation.<br />`Config Param:
MAKE_NEW_COL [...]
-|
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
| false
| Should we url encode the partition path
value, before creating the folder structure.<br />`Config Param:
URL_ENCODE_PARTITIONING`
[...]
-| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class
used. Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME`
[...]
-|
[hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema)
| false
| This config controls how
writer's schema will be selected based on the incoming batch's schema as well
as existing table's one. When schema reconciliation is DISABLED, incoming
batch's schema will be picked as a writer-schema (therefore updating table's
schema). When schema recon [...]
-|
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
|
org.apache.hudi.common.model.HoodieAvroRecordMerger | List of
HoodieMerger implementations constituting Hudi's merging strategy -- based on
the engine used. These merger impls will filter by
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient
implementation to perform merging/combining of the records (during [...]
-|
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
|
eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger
strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
[...]
-|
[hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable)
| true
| When set to true, will perform
write operations directly using the spark native `Row` representation, avoiding
any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER`
[...]
-|
[hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier)
| default_single_writer
| A stream identifier used for HUDI to fetch the right
checkpoint(`batch id` to be more specific) corresponding this writer. Please
note that keep the identifier an unique value for different writer if under
multi-writer scenario. If the value is not set, will only keep the checkpo [...]
-|
[hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction)
| false
| By default for MOR table, async compaction is enabled
with spark streaming sink. By setting this config to true, we can disable it
and the expectation is that, users will schedule and execute compaction in a
different process/job altogether. Some users may wish to run it separate [...]
-|
[hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch)
| false
| Config to indicate whether to ignore any non exception
error (e.g. writestatus error) within a streaming microbatch. Turning this on,
could hide the write status errors while the spark checkpoint moves ahead.So,
would recommend users to use this with caution.<br />`Config Param: [...]
-|
[hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount)
| 3
| Config to indicate how many times
streaming job should retry for a failed micro batch.<br />`Config Param:
STREAMING_RETRY_CNT`
[...]
-|
[hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms)
| 2000
| Config to indicate how long (by millisecond)
before a retry should issued for failed microbatch<br />`Config Param:
STREAMING_RETRY_INTERVAL_MS`
[...]
-| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass)
|
org.apache.hudi.hive.HiveSyncTool | Sync tool class
name used to sync to metastore. Defaults to Hive.<br />`Config Param:
META_SYNC_CLIENT_TOOL_CLASS_NAME`
[...]
+| Config Name
| Default
| Description
[...]
+|
------------------------------------------------------------------------------------------------------------------------------------------------
| ------------------------------------------------------------
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+|
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
| (N/A)
| Serde properties to hive table.<br
/>`Config Param: HIVE_TABLE_SERDE_PROPERTIES`
[...]
+|
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
| (N/A)
| Additional properties to store with
table.<br />`Config Param: HIVE_TABLE_PROPERTIES`
[...]
+| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)
| (N/A)
| Controls whether overwrite
use dynamic or static mode, if not configured, respect
spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br
/>`Since Version: 0.14.0`
[...]
+|
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
| (N/A)
| Comma separated list of partitions to
delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE`
[...]
+| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)
| (N/A)
| Table name for the
datasource write. Also used to register the table into meta stores.<br
/>`Config Param: TABLE_NAME`
[...]
+|
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
| true
| Controls whether async
compaction should be turned on for MOR table writing.<br />`Config Param:
ASYNC_COMPACT_ENABLE`
[...]
+|
[hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning)
| false
| Assume partitioning is yyyy/MM/dd<br />`Config Param:
HIVE_ASSUME_DATE_PARTITION`
[...]
+|
[hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database)
| true
| Auto create hive database if does not exists<br
/>`Config Param: HIVE_AUTO_CREATE_DATABASE`
[...]
+|
[hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format)
| PARQUET
| Base file format for the sync.<br
/>`Config Param: HIVE_BASE_FILE_FORMAT`
[...]
+| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num)
| 1000
| The number of partitions
one batch when synchronous partitions to hive.<br />`Config Param:
HIVE_BATCH_SYNC_PARTITION_NUM`
[...]
+|
[hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync)
| false
| Whether sync hive metastore
bucket specification when using bucket index.The specification is 'CLUSTERED BY
(trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param:
HIVE_SYNC_BUCKET_SYNC`
[...]
+|
[hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table)
| false
| Whether to sync the table as managed table.<br
/>`Config Param: HIVE_CREATE_MANAGED_TABLE`
[...]
+| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase)
| default
| The name of the
destination database that we should sync the hudi table to.<br />`Config Param:
HIVE_DATABASE`
[...]
+|
[hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions)
| false
| Ignore exceptions when syncing with
Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS`
[...]
+|
[hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class)
|
org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which
implements PartitionValueExtractor to extract the partition values, default
'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param:
HIVE_PARTITION_EXTRACTOR_CLASS`
[...]
+|
[hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields)
|
| Field in the table to use for
determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS`
[...]
+| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword)
| hive
| hive password to use<br
/>`Config Param: HIVE_PASS`
[...]
+|
[hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix)
| false
| Skip the _ro suffix for Read
optimized table, when registering<br />`Config Param:
HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`
[...]
+|
[hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp)
| false
| ‘INT64’ with original type
TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for
backward compatibility.<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`
[...]
+|
[hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource)
| true
| <br />`Config Param:
HIVE_SYNC_AS_DATA_SOURCE_TABLE`
[...]
+|
[hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment)
| false
| Whether to sync the table
column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT`
[...]
+| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable)
| unknown
| The name of the
destination table that we should sync the hudi table to.<br />`Config Param:
HIVE_TABLE`
[...]
+| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
[...]
+|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package
instead of org.apache.hudi package. Use this when you are in the process of
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you
migrated the table definition to org.apache.hudi input format<br />`Co [...]
+| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
[...]
+| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| **Note** This is only
applicable to Spark SQL writing.<br />When operation type is set to "insert",
users can optionally enforce a dedup policy. This policy will be employed when
records being ingested already exists in storage. Default policy is none and no
action will be taken [...]
+|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions
like schema change or partition change.<br />`Config Param:
HIVE_CONDITIONAL_SYNC`
[...]
+|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix,
are automatically added to the commit/deltacommit metadata. This is useful to
store checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
[...]
+|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
[...]
+|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. <br /> **Note** Just for Insert operation in Spark SQL writing
since 0.14.0, users can switch to the config
`hoodie.datasource.insert.dup.policy` [...]
+|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config
Param: KEYGENERATOR_CLASS_NAME`
[...]
+|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to
true, consistent value will be generated for a logical timestamp type column,
like timestamp-millis and timestamp-micros, irrespective of whether row-writer
is enabled. Disabled by default so as not to break the pipeline that deploy
either fully row-writer path or non [...]
+|
[hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable)
| false
| When a non-nullable column is added
to datasource during a write operation, the write operation will fail schema
compatibility check. Set this option to true will make the newly added column
nullable to successfully complete the write operation.<br />`Config Param:
MAKE_NEW_COL [...]
+|
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
| false
| Should we url encode the partition path
value, before creating the folder structure.<br />`Config Param:
URL_ENCODE_PARTITIONING`
[...]
+| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class
used. Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME`
[...]
+|
[hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema)
| false
| This config controls how
writer's schema will be selected based on the incoming batch's schema as well
as existing table's one. When schema reconciliation is DISABLED, incoming
batch's schema will be picked as a writer-schema (therefore updating table's
schema). When schema recon [...]
+|
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
|
org.apache.hudi.common.model.HoodieAvroRecordMerger | List of
HoodieMerger implementations constituting Hudi's merging strategy -- based on
the engine used. These merger impls will filter by
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient
implementation to perform merging/combining of the records (during [...]
+|
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
|
eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger
strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
[...]
+|
[hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable)
| true
| When set to true, will perform
write operations directly using the spark native `Row` representation, avoiding
any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER`
[...]
+|
[hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier)
| default_single_writer
| A stream identifier used for HUDI to fetch the right
checkpoint(`batch id` to be more specific) corresponding this writer. Please
note that keep the identifier an unique value for different writer if under
multi-writer scenario. If the value is not set, will only keep the checkpo [...]
+|
[hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction)
| false
| By default for MOR table, async compaction is enabled
with spark streaming sink. By setting this config to true, we can disable it
and the expectation is that, users will schedule and execute compaction in a
different process/job altogether. Some users may wish to run it separate [...]
+|
[hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch)
| false
| Config to indicate whether to ignore any non exception
error (e.g. writestatus error) within a streaming microbatch. Turning this on,
could hide the write status errors while the spark checkpoint moves ahead.So,
would recommend users to use this with caution.<br />`Config Param: [...]
+|
[hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount)
| 3
| Config to indicate how many times
streaming job should retry for a failed micro batch.<br />`Config Param:
STREAMING_RETRY_CNT`
[...]
+|
[hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms)
| 2000
| Config to indicate how long (by millisecond)
before a retry should issued for failed microbatch<br />`Config Param:
STREAMING_RETRY_INTERVAL_MS`
[...]
+| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass)
|
org.apache.hudi.hive.HiveSyncTool | Sync tool class
name used to sync to metastore. Defaults to Hive.<br />`Config Param:
META_SYNC_CLIENT_TOOL_CLASS_NAME`
[...]
| [hoodie.spark.sql.insert.into.operation](#hoodiesparksqlinsertintooperation)
| insert
| Sql write operation to use
with INSERT_INTO spark sql command. This comes with 3 possible values,
bulk_insert, insert and upsert. bulk_insert is generally meant for initial
loads and is known to be performant compared to insert. But bulk_insert may not
do small file management. I [...]
-|
[hoodie.spark.sql.optimized.writes.enable](#hoodiesparksqloptimizedwritesenable)
| true
| Controls whether spark sql
prepped update, delete, and merge are enabled.<br />`Config Param:
SPARK_SQL_OPTIMIZED_WRITES`<br />`Since Version: 0.14.0`
[...]
-| [hoodie.sql.bulk.insert.enable](#hoodiesqlbulkinsertenable)
| false
| When set to true, the sql
insert statement will use bulk insert. This config is deprecated as of 0.14.0.
Please use hoodie.spark.sql.insert.into.operation instead.<br />`Config Param:
SQL_ENABLE_BULK_INSERT`
[...]
-| [hoodie.sql.insert.mode](#hoodiesqlinsertmode)
| upsert
| Insert mode when insert
data to pk-table. The optional modes are: upsert, strict and non-strict.For
upsert mode, insert statement do the upsert operation for the pk-table which
will update the duplicate record.For strict mode, insert statement will keep
the primary key uniqueness [...]
-|
[hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass)
|
io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is
used by kafka client to deserialize the records<br />`Config Param:
KAFKA_AVRO_VALUE_DESERIALIZER_CLASS`<br />`Since Version: 0.9.0`
[...]
-|
[hoodie.write.set.null.for.missing.columns](#hoodiewritesetnullformissingcolumns)
| false
| When a non-nullable column is
missing from incoming batch during a write operation, the write operation will
fail schema compatibility check. Set this option to true will make the missing
column be filled with null values to successfully complete the write
operation.<br />`Conf [...]
+|
[hoodie.spark.sql.optimized.writes.enable](#hoodiesparksqloptimizedwritesenable)
| true
| Controls whether spark sql
prepped update, delete, and merge are enabled.<br />`Config Param:
SPARK_SQL_OPTIMIZED_WRITES`<br />`Since Version: 0.14.0`
[...]
+| [hoodie.sql.bulk.insert.enable](#hoodiesqlbulkinsertenable)
| false
| When set to true, the sql
insert statement will use bulk insert. This config is deprecated as of 0.14.0.
Please use hoodie.spark.sql.insert.into.operation instead.<br />`Config Param:
SQL_ENABLE_BULK_INSERT`
[...]
+| [hoodie.sql.insert.mode](#hoodiesqlinsertmode)
| upsert
| Insert mode when insert
data to pk-table. The optional modes are: upsert, strict and non-strict.For
upsert mode, insert statement do the upsert operation for the pk-table which
will update the duplicate record.For strict mode, insert statement will keep
the primary key uniqueness [...]
+|
[hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass)
|
io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is
used by kafka client to deserialize the records<br />`Config Param:
KAFKA_AVRO_VALUE_DESERIALIZER_CLASS`<br />`Since Version: 0.9.0`
[...]
+|
[hoodie.write.set.null.for.missing.columns](#hoodiewritesetnullformissingcolumns)
| false
| When a non-nullable column is
missing from incoming batch during a write operation, the write operation will
fail schema compatibility check. Set this option to true will make the missing
column be filled with null values to successfully complete the write
operation.<br />`Conf [...]
---
diff --git a/website/releases/release-0.14.0.md
b/website/releases/release-0.14.0.md
index 1efe426fe9e..266db7c0c2b 100644
--- a/website/releases/release-0.14.0.md
+++ b/website/releases/release-0.14.0.md
@@ -77,9 +77,13 @@ by the key generation policy, can only be ingested once into
the target table.
With this addition, an older related configuration setting,
[`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates),
-is now deprecated. The newer configuration will take precedence over the old
one when both are specified. If no specific
+will be deprecated. The newer configuration will take precedence over the old
one when both are specified. If no specific
configurations are provided, the default value for the newer configuration
will be assumed. Users are strongly encouraged
-to migrate to the use of these newer configurations.
+to migrate to the use of these newer configurations when using Spark SQL.
+
+:::caution
+This is only applicable to Spark SQL writing.
+:::
#### Compaction with MOR table
diff --git a/website/versioned_docs/version-0.14.0/configurations.md
b/website/versioned_docs/version-0.14.0/configurations.md
index 435a71c1df4..9223915af52 100644
--- a/website/versioned_docs/version-0.14.0/configurations.md
+++ b/website/versioned_docs/version-0.14.0/configurations.md
@@ -161,11 +161,11 @@ Options useful for writing tables via
`write.format.option(...)`
| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
[...]
|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package
instead of org.apache.hudi package. Use this when you are in the process of
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you
migrated the table definition to org.apache.hudi input format<br />`Co [...]
| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
[...]
-| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| When operation type is set
to "insert", users can optionally enforce a dedup policy. This policy will be
employed when records being ingested already exists in storage. Default policy
is none and no action will be taken. Another option is to choose "drop", on
which matching rec [...]
+| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| **Note** This is only
applicable to Spark SQL writing.<br />When operation type is set to "insert",
users can optionally enforce a dedup policy. This policy will be employed when
records being ingested already exists in storage. Default policy is none and no
action will be taken [...]
|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions
like schema change or partition change.<br />`Config Param:
HIVE_CONDITIONAL_SYNC`
[...]
|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix,
are automatically added to the commit/deltacommit metadata. This is useful to
store checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
[...]
|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
[...]
-|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. This config is deprecated as of 0.14.0. Please use
hoodie.datasource.insert.dup.policy instead.<br />`Config Param:
INSERT_DROP_DUPS` [...]
+|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. <br /> **Note** Just for Insert operation in Spark SQL
writing since 0.14.0, users can switch to the config
`hoodie.datasource.insert.dup.policy` [...]
|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config
Param: KEYGENERATOR_CLASS_NAME`
[...]
|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to
true, consistent value will be generated for a logical timestamp type column,
like timestamp-millis and timestamp-micros, irrespective of whether row-writer
is enabled. Disabled by default so as not to break the pipeline that deploy
either fully row-writer path or non [...]
|
[hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable)
| false
| When a non-nullable column is added
to datasource during a write operation, the write operation will fail schema
compatibility check. Set this option to true will make the newly added column
nullable to successfully complete the write operation.<br />`Config Param:
MAKE_NEW_COL [...]
diff --git a/website/versioned_docs/version-0.14.1/configurations.md
b/website/versioned_docs/version-0.14.1/configurations.md
index 01ef8401954..1efcce1f4cb 100644
--- a/website/versioned_docs/version-0.14.1/configurations.md
+++ b/website/versioned_docs/version-0.14.1/configurations.md
@@ -154,11 +154,11 @@ Options useful for writing tables via
`write.format.option(...)`
| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
[...]
|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package
instead of org.apache.hudi package. Use this when you are in the process of
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you
migrated the table definition to org.apache.hudi input format<br />`Co [...]
| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
[...]
-| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| When operation type is set
to "insert", users can optionally enforce a dedup policy. This policy will be
employed when records being ingested already exists in storage. Default policy
is none and no action will be taken. Another option is to choose "drop", on
which matching rec [...]
+| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| **Note** This is only
applicable to Spark SQL writing.<br />When operation type is set to "insert",
users can optionally enforce a dedup policy. This policy will be employed when
records being ingested already exists in storage. Default policy is none and no
action will be taken [...]
|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions
like schema change or partition change.<br />`Config Param:
HIVE_CONDITIONAL_SYNC`
[...]
|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix,
are automatically added to the commit/deltacommit metadata. This is useful to
store checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
[...]
|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
[...]
-|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. This config is deprecated as of 0.14.0. Please use
hoodie.datasource.insert.dup.policy instead.<br />`Config Param:
INSERT_DROP_DUPS` [...]
+|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. <br /> **Note** Just for Insert operation in Spark SQL
writing since 0.14.0, users can switch to the config
`hoodie.datasource.insert.dup.policy` [...]
|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config
Param: KEYGENERATOR_CLASS_NAME`
[...]
|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to
true, consistent value will be generated for a logical timestamp type column,
like timestamp-millis and timestamp-micros, irrespective of whether row-writer
is enabled. Disabled by default so as not to break the pipeline that deploy
either fully row-writer path or non [...]
|
[hoodie.datasource.write.new.columns.nullable](#hoodiedatasourcewritenewcolumnsnullable)
| false
| When a non-nullable column is added
to datasource during a write operation, the write operation will fail schema
compatibility check. Set this option to true will make the newly added column
nullable to successfully complete the write operation.<br />`Config Param:
MAKE_NEW_COL [...]