This is an automated email from the ASF dual-hosted git repository.
danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 98f2ca487bd [HUDI-6854][DOCS] Change default payload type to
HOODIE_AVRO_DEFAULT (#11551)
98f2ca487bd is described below
commit 98f2ca487bd53eb4edd05187a9a7a7d58140db79
Author: Vova Kolmakov <[email protected]>
AuthorDate: Tue Jul 2 07:28:38 2024 +0700
[HUDI-6854][DOCS] Change default payload type to HOODIE_AVRO_DEFAULT
(#11551)
---
website/docs/basic_configurations.md | 63 ++++++-------
website/docs/configurations.md | 173 ++++++++++++++++++-----------------
2 files changed, 120 insertions(+), 116 deletions(-)
diff --git a/website/docs/basic_configurations.md
b/website/docs/basic_configurations.md
index 9d579738ca9..08d5ac717d9 100644
--- a/website/docs/basic_configurations.md
+++ b/website/docs/basic_configurations.md
@@ -1,7 +1,7 @@
---
title: Basic Configurations
summary: This page covers the basic configurations you may use to write/read
Hudi tables. This page only features a subset of the most frequently used
configurations. For a full list of all configs, please visit the [All
Configurations](/docs/configurations) page.
-last_modified_at: 2024-06-06T12:59:56.064
+last_modified_at: 2024-07-01T15:09:57.63
---
@@ -33,36 +33,37 @@ Configurations of the Hudi Table like type of ingestion,
storage formats, hive t
[**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs)
-| Config Name
| Default
| Description
[...]
-|
------------------------------------------------------------------------------------------------
| ----------------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath)
| (N/A)
| Base path of the dataset that needs to be bootstrapped as a Hudi
table<br />`Config Param: BOOTSTRAP_BASE_PATH`
[...]
-| [hoodie.database.name](#hoodiedatabasename)
| (N/A)
| Database name that will be used for incremental query.If different
databases have the same table name during incremental query, we can set it to
limit the table name under a specific database<br />`Config Param:
DATABASE_NAME`
[...]
-| [hoodie.table.checksum](#hoodietablechecksum)
| (N/A)
| Table checksum is used to guard against partial writes in HDFS. It is
added as the last entry in hoodie.properties and then used to validate while
reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version:
0.11.0`
[...]
-| [hoodie.table.create.schema](#hoodietablecreateschema)
| (N/A)
| Schema used when creating the table, for the first time.<br />`Config
Param: CREATE_SCHEMA`
[...]
-| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass)
| (N/A)
| Key Generator class property for the hoodie table<br />`Config Param:
KEY_GENERATOR_CLASS_NAME`
[...]
-| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions)
| (N/A)
| Comma-separated list of metadata partitions that have been completely
built and in-sync with data table. These partitions are ready for use by the
readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version:
0.11.0`
[...]
-|
[hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight)
| (N/A)
| Comma-separated list of metadata partitions whose building is in progress.
These partitions are not yet ready for use by the readers.<br />`Config Param:
TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0`
[...]
-| [hoodie.table.name](#hoodietablename)
| (N/A)
| Table name that will be used for registering with Hive. Needs to be
same across runs.<br />`Config Param: NAME`
[...]
-| [hoodie.table.partition.fields](#hoodietablepartitionfields)
| (N/A)
| Fields used to partition the table. Concatenated values of these
fields are used as the partition path, by invoking toString()<br />`Config
Param: PARTITION_FIELDS`
[...]
-| [hoodie.table.precombine.field](#hoodietableprecombinefield)
| (N/A)
| Field used in preCombining before actual write. By default, when two
records have the same key value, the largest value for the precombine field
determined by Object.compareTo(..), is picked.<br />`Config Param:
PRECOMBINE_FIELD`
[...]
-| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields)
| (N/A)
| Columns used to uniquely identify the table. Concatenated values of
these fields are used as the record key component of HoodieKey.<br />`Config
Param: RECORDKEY_FIELDS`
[...]
-|
[hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata)
| (N/A)
| The metadata of secondary indexes<br />`Config Param:
SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0`
[...]
-| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion)
| (N/A)
| Version of timeline used, by the table.<br />`Config Param:
TIMELINE_LAYOUT_VERSION`
[...]
-| [hoodie.archivelog.folder](#hoodiearchivelogfolder)
| archived
| path under the meta folder, to store archived timeline instants
at.<br />`Config Param: ARCHIVELOG_FOLDER`
[...]
-| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass)
|
org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex |
Implementation to use, for mapping base files to bootstrap base file, that
contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
[...]
-| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable)
| true
| Whether or not, this is a bootstrapped table, with bootstrap base
data and an mapping index defined, default true.<br />`Config Param:
BOOTSTRAP_INDEX_ENABLE`
[...]
-| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload
class to use for performing compactions, i.e merge delta logs with current base
file and then produce a new base file.<br />`Config Param: PAYLOAD_CLASS_NAME`
[...]
-|
[hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
| Id of merger strategy. Hudi will pick HoodieRecordMerger implementations
in hoodie.datasource.write.record.merger.impls which has the same merger
strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version:
0.13.0`
[...]
-|
[hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning)
| false | Flag to
indicate whether to use Hive style partitioning. If set true, the names of
partition folders follow <partition_column_name>=<partition_value>
format. By default false (the names of partition folders are only partition
values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
[...]
-|
[hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat)
| false
| If true, partition metafiles are saved in the same format as base-files
for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles
are saved as properties files.<br />`Config Param:
PARTITION_METAFILE_USE_BASE_FORMAT`
[...]
-| [hoodie.populate.meta.fields](#hoodiepopulatemetafields)
| true
| When enabled, populates all meta fields. When disabled, no meta
fields are populated and incremental queries will not be functional. This is
only meant to be used for append only/immutable data for batch processing<br
/>`Config Param: POPULATE_META_FIELDS`
[...]
-| [hoodie.table.base.file.format](#hoodietablebasefileformat)
| PARQUET
| Base file format to store all the base file data.<br />`Config Param:
BASE_FILE_FORMAT`
[...]
-| [hoodie.table.cdc.enabled](#hoodietablecdcenabled)
| false
| When enable, persist the change data if necessary, and can be queried
as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version:
0.13.0`
[...]
-|
[hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode)
| DATA_BEFORE_AFTER
| org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change
log capture supplemental logging mode. The supplemental log is used for
accelerating the generation of change log details. OP_KEY_ONLY: Only
keeping record keys in the supplemental logs, so the reader needs to figure out
the update before image [...]
-| [hoodie.table.log.file.format](#hoodietablelogfileformat)
| HOODIE_LOG
| Log format used for the delta logs.<br />`Config Param:
LOG_FILE_FORMAT`
[...]
-| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone)
| LOCAL
| User can set hoodie commit timeline timezone, such as utc, local and
so on. local is default<br />`Config Param: TIMELINE_TIMEZONE`
[...]
-| [hoodie.table.type](#hoodietabletype)
| COPY_ON_WRITE
| The table type for the underlying data, for this write. This can’t
change between writes.<br />`Config Param: TYPE`
[...]
-| [hoodie.table.version](#hoodietableversion)
| ZERO
| Version of table, used for running upgrade/downgrade steps between
releases with potentially breaking/backwards compatible changes.<br />`Config
Param: VERSION`
[...]
+| Config Name
| Default
| Description
[...]
+|
------------------------------------------------------------------------------------------------
| ----------------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath)
| (N/A)
| Base path of the dataset that needs to be bootstrapped as a Hudi
table<br />`Config Param: BOOTSTRAP_BASE_PATH`
[...]
+| [hoodie.database.name](#hoodiedatabasename)
| (N/A)
| Database name that will be used for incremental query.If different
databases have the same table name during incremental query, we can set it to
limit the table name under a specific database<br />`Config Param:
DATABASE_NAME`
[...]
+| [hoodie.table.checksum](#hoodietablechecksum)
| (N/A)
| Table checksum is used to guard against partial writes in HDFS. It is
added as the last entry in hoodie.properties and then used to validate while
reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version:
0.11.0`
[...]
+| [hoodie.table.create.schema](#hoodietablecreateschema)
| (N/A)
| Schema used when creating the table, for the first time.<br />`Config
Param: CREATE_SCHEMA`
[...]
+| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass)
| (N/A)
| Key Generator class property for the hoodie table<br />`Config Param:
KEY_GENERATOR_CLASS_NAME`
[...]
+| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions)
| (N/A)
| Comma-separated list of metadata partitions that have been completely
built and in-sync with data table. These partitions are ready for use by the
readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version:
0.11.0`
[...]
+|
[hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight)
| (N/A)
| Comma-separated list of metadata partitions whose building is in progress.
These partitions are not yet ready for use by the readers.<br />`Config Param:
TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0`
[...]
+| [hoodie.table.name](#hoodietablename)
| (N/A)
| Table name that will be used for registering with Hive. Needs to be
same across runs.<br />`Config Param: NAME`
[...]
+| [hoodie.table.partition.fields](#hoodietablepartitionfields)
| (N/A)
| Fields used to partition the table. Concatenated values of these
fields are used as the partition path, by invoking toString()<br />`Config
Param: PARTITION_FIELDS`
[...]
+| [hoodie.table.precombine.field](#hoodietableprecombinefield)
| (N/A)
| Field used in preCombining before actual write. By default, when two
records have the same key value, the largest value for the precombine field
determined by Object.compareTo(..), is picked.<br />`Config Param:
PRECOMBINE_FIELD`
[...]
+| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields)
| (N/A)
| Columns used to uniquely identify the table. Concatenated values of
these fields are used as the record key component of HoodieKey.<br />`Config
Param: RECORDKEY_FIELDS`
[...]
+|
[hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata)
| (N/A)
| The metadata of secondary indexes<br />`Config Param:
SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0`
[...]
+| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion)
| (N/A)
| Version of timeline used, by the table.<br />`Config Param:
TIMELINE_LAYOUT_VERSION`
[...]
+| [hoodie.archivelog.folder](#hoodiearchivelogfolder)
| archived
| path under the meta folder, to store archived timeline instants
at.<br />`Config Param: ARCHIVELOG_FOLDER`
[...]
+| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass)
|
org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex |
Implementation to use, for mapping base files to bootstrap base file, that
contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
[...]
+| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable)
| true
| Whether or not, this is a bootstrapped table, with bootstrap base
data and an mapping index defined, default true.<br />`Config Param:
BOOTSTRAP_INDEX_ENABLE`
[...]
+| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass)
| org.apache.hudi.common.model.DefaultHoodieRecordPayload
| Payload class to use for performing compactions, i.e merge delta logs
with current base file and then produce a new base file.<br />`Config Param:
PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.compaction.payload.type](#hoodiecompactionpayloadtype)
| HOODIE_AVRO_DEFAULT
| org.apache.hudi.common.model.RecordPayloadType: Payload to use for
merging records AWS_DMS_AVRO: Provides support for seamlessly applying
changes captured via Amazon Database Migration Service onto S3.
HOODIE_AVRO: A payload to wrap a existing Hoodie Avro Record. Useful to create
a HoodieRecord over existing Gener [...]
+|
[hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
| Id of merger strategy. Hudi will pick HoodieRecordMerger implementations
in hoodie.datasource.write.record.merger.impls which has the same merger
strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version:
0.13.0`
[...]
+|
[hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning)
| false | Flag to
indicate whether to use Hive style partitioning. If set true, the names of
partition folders follow <partition_column_name>=<partition_value>
format. By default false (the names of partition folders are only partition
values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
[...]
+|
[hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat)
| false
| If true, partition metafiles are saved in the same format as base-files
for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles
are saved as properties files.<br />`Config Param:
PARTITION_METAFILE_USE_BASE_FORMAT`
[...]
+| [hoodie.populate.meta.fields](#hoodiepopulatemetafields)
| true
| When enabled, populates all meta fields. When disabled, no meta
fields are populated and incremental queries will not be functional. This is
only meant to be used for append only/immutable data for batch processing<br
/>`Config Param: POPULATE_META_FIELDS`
[...]
+| [hoodie.table.base.file.format](#hoodietablebasefileformat)
| PARQUET
| Base file format to store all the base file data.<br />`Config Param:
BASE_FILE_FORMAT`
[...]
+| [hoodie.table.cdc.enabled](#hoodietablecdcenabled)
| false
| When enable, persist the change data if necessary, and can be queried
as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version:
0.13.0`
[...]
+|
[hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode)
| DATA_BEFORE_AFTER
| org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change
log capture supplemental logging mode. The supplemental log is used for
accelerating the generation of change log details. OP_KEY_ONLY: Only
keeping record keys in the supplemental logs, so the reader needs to figure out
the update before image [...]
+| [hoodie.table.log.file.format](#hoodietablelogfileformat)
| HOODIE_LOG
| Log format used for the delta logs.<br />`Config Param:
LOG_FILE_FORMAT`
[...]
+| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone)
| LOCAL
| User can set hoodie commit timeline timezone, such as utc, local and
so on. local is default<br />`Config Param: TIMELINE_TIMEZONE`
[...]
+| [hoodie.table.type](#hoodietabletype)
| COPY_ON_WRITE
| The table type for the underlying data, for this write. This can’t
change between writes.<br />`Config Param: TYPE`
[...]
+| [hoodie.table.version](#hoodietableversion)
| ZERO
| Version of table, used for running upgrade/downgrade steps between
releases with potentially breaking/backwards compatible changes.<br />`Config
Param: VERSION`
[...]
---
## Spark Datasource Configs {#SPARK_DATASOURCE}
diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index 278be1f5afa..a6814502cdc 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -5,7 +5,7 @@ permalink: /docs/configurations.html
summary: This page covers the different ways of configuring your job to
write/read Hudi tables. At a high level, you can control behaviour at few
levels.
toc_min_heading_level: 2
toc_max_heading_level: 4
-last_modified_at: 2024-06-06T12:59:56.026
+last_modified_at: 2024-07-01T15:09:57.588
---
@@ -54,36 +54,37 @@ Configurations of the Hudi Table like type of ingestion,
storage formats, hive t
[**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs)
-| Config Name
| Default
| Description
[...]
-|
------------------------------------------------------------------------------------------------
| ----------------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath)
| (N/A)
| Base path of the dataset that needs to be bootstrapped as a Hudi
table<br />`Config Param: BOOTSTRAP_BASE_PATH`
[...]
-| [hoodie.database.name](#hoodiedatabasename)
| (N/A)
| Database name that will be used for incremental query.If different
databases have the same table name during incremental query, we can set it to
limit the table name under a specific database<br />`Config Param:
DATABASE_NAME`
[...]
-| [hoodie.table.checksum](#hoodietablechecksum)
| (N/A)
| Table checksum is used to guard against partial writes in HDFS. It is
added as the last entry in hoodie.properties and then used to validate while
reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version:
0.11.0`
[...]
-| [hoodie.table.create.schema](#hoodietablecreateschema)
| (N/A)
| Schema used when creating the table, for the first time.<br />`Config
Param: CREATE_SCHEMA`
[...]
-| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass)
| (N/A)
| Key Generator class property for the hoodie table<br />`Config Param:
KEY_GENERATOR_CLASS_NAME`
[...]
-| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions)
| (N/A)
| Comma-separated list of metadata partitions that have been completely
built and in-sync with data table. These partitions are ready for use by the
readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version:
0.11.0`
[...]
-|
[hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight)
| (N/A)
| Comma-separated list of metadata partitions whose building is in progress.
These partitions are not yet ready for use by the readers.<br />`Config Param:
TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0`
[...]
-| [hoodie.table.name](#hoodietablename)
| (N/A)
| Table name that will be used for registering with Hive. Needs to be
same across runs.<br />`Config Param: NAME`
[...]
-| [hoodie.table.partition.fields](#hoodietablepartitionfields)
| (N/A)
| Fields used to partition the table. Concatenated values of these
fields are used as the partition path, by invoking toString()<br />`Config
Param: PARTITION_FIELDS`
[...]
-| [hoodie.table.precombine.field](#hoodietableprecombinefield)
| (N/A)
| Field used in preCombining before actual write. By default, when two
records have the same key value, the largest value for the precombine field
determined by Object.compareTo(..), is picked.<br />`Config Param:
PRECOMBINE_FIELD`
[...]
-| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields)
| (N/A)
| Columns used to uniquely identify the table. Concatenated values of
these fields are used as the record key component of HoodieKey.<br />`Config
Param: RECORDKEY_FIELDS`
[...]
-|
[hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata)
| (N/A)
| The metadata of secondary indexes<br />`Config Param:
SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0`
[...]
-| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion)
| (N/A)
| Version of timeline used, by the table.<br />`Config Param:
TIMELINE_LAYOUT_VERSION`
[...]
-| [hoodie.archivelog.folder](#hoodiearchivelogfolder)
| archived
| path under the meta folder, to store archived timeline instants
at.<br />`Config Param: ARCHIVELOG_FOLDER`
[...]
-| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass)
|
org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex |
Implementation to use, for mapping base files to bootstrap base file, that
contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
[...]
-| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable)
| true
| Whether or not, this is a bootstrapped table, with bootstrap base
data and an mapping index defined, default true.<br />`Config Param:
BOOTSTRAP_INDEX_ENABLE`
[...]
-| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload
class to use for performing compactions, i.e merge delta logs with current base
file and then produce a new base file.<br />`Config Param: PAYLOAD_CLASS_NAME`
[...]
-|
[hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
| Id of merger strategy. Hudi will pick HoodieRecordMerger implementations
in hoodie.datasource.write.record.merger.impls which has the same merger
strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version:
0.13.0`
[...]
-|
[hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning)
| false | Flag to
indicate whether to use Hive style partitioning. If set true, the names of
partition folders follow <partition_column_name>=<partition_value>
format. By default false (the names of partition folders are only partition
values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
[...]
-|
[hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat)
| false
| If true, partition metafiles are saved in the same format as base-files
for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles
are saved as properties files.<br />`Config Param:
PARTITION_METAFILE_USE_BASE_FORMAT`
[...]
-| [hoodie.populate.meta.fields](#hoodiepopulatemetafields)
| true
| When enabled, populates all meta fields. When disabled, no meta
fields are populated and incremental queries will not be functional. This is
only meant to be used for append only/immutable data for batch processing<br
/>`Config Param: POPULATE_META_FIELDS`
[...]
-| [hoodie.table.base.file.format](#hoodietablebasefileformat)
| PARQUET
| Base file format to store all the base file data.<br />`Config Param:
BASE_FILE_FORMAT`
[...]
-| [hoodie.table.cdc.enabled](#hoodietablecdcenabled)
| false
| When enable, persist the change data if necessary, and can be queried
as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version:
0.13.0`
[...]
-|
[hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode)
| DATA_BEFORE_AFTER
| org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change
log capture supplemental logging mode. The supplemental log is used for
accelerating the generation of change log details. OP_KEY_ONLY: Only
keeping record keys in the supplemental logs, so the reader needs to figure out
the update before image [...]
-| [hoodie.table.log.file.format](#hoodietablelogfileformat)
| HOODIE_LOG
| Log format used for the delta logs.<br />`Config Param:
LOG_FILE_FORMAT`
[...]
-| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone)
| LOCAL
| User can set hoodie commit timeline timezone, such as utc, local and
so on. local is default<br />`Config Param: TIMELINE_TIMEZONE`
[...]
-| [hoodie.table.type](#hoodietabletype)
| COPY_ON_WRITE
| The table type for the underlying data, for this write. This can’t
change between writes.<br />`Config Param: TYPE`
[...]
-| [hoodie.table.version](#hoodietableversion)
| ZERO
| Version of table, used for running upgrade/downgrade steps between
releases with potentially breaking/backwards compatible changes.<br />`Config
Param: VERSION`
[...]
+| Config Name
| Default
| Description
[...]
+|
------------------------------------------------------------------------------------------------
| ----------------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath)
| (N/A)
| Base path of the dataset that needs to be bootstrapped as a Hudi
table<br />`Config Param: BOOTSTRAP_BASE_PATH`
[...]
+| [hoodie.database.name](#hoodiedatabasename)
| (N/A)
| Database name that will be used for incremental query.If different
databases have the same table name during incremental query, we can set it to
limit the table name under a specific database<br />`Config Param:
DATABASE_NAME`
[...]
+| [hoodie.table.checksum](#hoodietablechecksum)
| (N/A)
| Table checksum is used to guard against partial writes in HDFS. It is
added as the last entry in hoodie.properties and then used to validate while
reading table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version:
0.11.0`
[...]
+| [hoodie.table.create.schema](#hoodietablecreateschema)
| (N/A)
| Schema used when creating the table, for the first time.<br />`Config
Param: CREATE_SCHEMA`
[...]
+| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass)
| (N/A)
| Key Generator class property for the hoodie table<br />`Config Param:
KEY_GENERATOR_CLASS_NAME`
[...]
+| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions)
| (N/A)
| Comma-separated list of metadata partitions that have been completely
built and in-sync with data table. These partitions are ready for use by the
readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version:
0.11.0`
[...]
+|
[hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight)
| (N/A)
| Comma-separated list of metadata partitions whose building is in progress.
These partitions are not yet ready for use by the readers.<br />`Config Param:
TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0`
[...]
+| [hoodie.table.name](#hoodietablename)
| (N/A)
| Table name that will be used for registering with Hive. Needs to be
same across runs.<br />`Config Param: NAME`
[...]
+| [hoodie.table.partition.fields](#hoodietablepartitionfields)
| (N/A)
| Fields used to partition the table. Concatenated values of these
fields are used as the partition path, by invoking toString()<br />`Config
Param: PARTITION_FIELDS`
[...]
+| [hoodie.table.precombine.field](#hoodietableprecombinefield)
| (N/A)
| Field used in preCombining before actual write. By default, when two
records have the same key value, the largest value for the precombine field
determined by Object.compareTo(..), is picked.<br />`Config Param:
PRECOMBINE_FIELD`
[...]
+| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields)
| (N/A)
| Columns used to uniquely identify the table. Concatenated values of
these fields are used as the record key component of HoodieKey.<br />`Config
Param: RECORDKEY_FIELDS`
[...]
+|
[hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata)
| (N/A)
| The metadata of secondary indexes<br />`Config Param:
SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0`
[...]
+| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion)
| (N/A)
| Version of timeline used, by the table.<br />`Config Param:
TIMELINE_LAYOUT_VERSION`
[...]
+| [hoodie.archivelog.folder](#hoodiearchivelogfolder)
| archived
| path under the meta folder, to store archived timeline instants
at.<br />`Config Param: ARCHIVELOG_FOLDER`
[...]
+| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass)
|
org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex |
Implementation to use, for mapping base files to bootstrap base file, that
contain actual data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
[...]
+| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable)
| true
| Whether or not, this is a bootstrapped table, with bootstrap base
data and an mapping index defined, default true.<br />`Config Param:
BOOTSTRAP_INDEX_ENABLE`
[...]
+| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass)
| org.apache.hudi.common.model.DefaultHoodieRecordPayload
| Payload class to use for performing compactions, i.e merge delta logs
with current base file and then produce a new base file.<br />`Config Param:
PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.compaction.payload.type](#hoodiecompactionpayloadtype)
| HOODIE_AVRO_DEFAULT
| org.apache.hudi.common.model.RecordPayloadType: Payload to use for
merging records AWS_DMS_AVRO: Provides support for seamlessly applying
changes captured via Amazon Database Migration Service onto S3.
HOODIE_AVRO: A payload to wrap a existing Hoodie Avro Record. Useful to create
a HoodieRecord over existing Gener [...]
+|
[hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
| Id of merger strategy. Hudi will pick HoodieRecordMerger implementations
in hoodie.datasource.write.record.merger.impls which has the same merger
strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version:
0.13.0`
[...]
+|
[hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning)
| false | Flag to
indicate whether to use Hive style partitioning. If set true, the names of
partition folders follow <partition_column_name>=<partition_value>
format. By default false (the names of partition folders are only partition
values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
[...]
+|
[hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat)
| false
| If true, partition metafiles are saved in the same format as base-files
for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles
are saved as properties files.<br />`Config Param:
PARTITION_METAFILE_USE_BASE_FORMAT`
[...]
+| [hoodie.populate.meta.fields](#hoodiepopulatemetafields)
| true
| When enabled, populates all meta fields. When disabled, no meta
fields are populated and incremental queries will not be functional. This is
only meant to be used for append only/immutable data for batch processing<br
/>`Config Param: POPULATE_META_FIELDS`
[...]
+| [hoodie.table.base.file.format](#hoodietablebasefileformat)
| PARQUET
| Base file format to store all the base file data.<br />`Config Param:
BASE_FILE_FORMAT`
[...]
+| [hoodie.table.cdc.enabled](#hoodietablecdcenabled)
| false
| When enable, persist the change data if necessary, and can be queried
as a CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version:
0.13.0`
[...]
+|
[hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode)
| DATA_BEFORE_AFTER
| org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change
log capture supplemental logging mode. The supplemental log is used for
accelerating the generation of change log details. OP_KEY_ONLY: Only
keeping record keys in the supplemental logs, so the reader needs to figure out
the update before image [...]
+| [hoodie.table.log.file.format](#hoodietablelogfileformat)
| HOODIE_LOG
| Log format used for the delta logs.<br />`Config Param:
LOG_FILE_FORMAT`
[...]
+| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone)
| LOCAL
| User can set hoodie commit timeline timezone, such as utc, local and
so on. local is default<br />`Config Param: TIMELINE_TIMEZONE`
[...]
+| [hoodie.table.type](#hoodietabletype)
| COPY_ON_WRITE
| The table type for the underlying data, for this write. This can’t
change between writes.<br />`Config Param: TYPE`
[...]
+| [hoodie.table.version](#hoodietableversion)
| ZERO
| Version of table, used for running upgrade/downgrade steps between
releases with potentially breaking/backwards compatible changes.<br />`Config
Param: VERSION`
[...]
[**Advanced Configs**](#Hudi-Table-Basic-Configs-advanced-configs)
@@ -181,58 +182,59 @@ Options useful for writing tables via
`write.format.option(...)`
[**Advanced Configs**](#Write-Options-advanced-configs)
-| Config Name
| Default
| Description
[...]
-|
------------------------------------------------------------------------------------------------------------------------------------------------
| ------------------------------------------------------------ |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-|
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
| (N/A)
| Serde properties to hive table.<br
/>`Config Param: HIVE_TABLE_SERDE_PROPERTIES`
[...]
-|
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
| (N/A)
| Additional properties to store with
table.<br />`Config Param: HIVE_TABLE_PROPERTIES`
[...]
-| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)
| (N/A)
| Controls whether overwrite
use dynamic or static mode, if not configured, respect
spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br
/>`Since Version: 0.14.0`
[...]
-|
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
| (N/A)
| Comma separated list of partitions to
delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE`
[...]
-| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)
| (N/A)
| Table name for the
datasource write. Also used to register the table into meta stores.<br
/>`Config Param: TABLE_NAME`
[...]
-|
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
| true
| Controls whether async
compaction should be turned on for MOR table writing.<br />`Config Param:
ASYNC_COMPACT_ENABLE`
[...]
-|
[hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning)
| false
| Assume partitioning is yyyy/MM/dd<br />`Config Param:
HIVE_ASSUME_DATE_PARTITION`
[...]
-|
[hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database)
| true
| Auto create hive database if does not exists<br
/>`Config Param: HIVE_AUTO_CREATE_DATABASE`
[...]
-|
[hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format)
| PARQUET
| Base file format for the sync.<br
/>`Config Param: HIVE_BASE_FILE_FORMAT`
[...]
-| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num)
| 1000
| The number of partitions
one batch when synchronous partitions to hive.<br />`Config Param:
HIVE_BATCH_SYNC_PARTITION_NUM`
[...]
-|
[hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync)
| false
| Whether sync hive metastore
bucket specification when using bucket index.The specification is 'CLUSTERED BY
(trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param:
HIVE_SYNC_BUCKET_SYNC`
[...]
-|
[hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table)
| false
| Whether to sync the table as managed table.<br
/>`Config Param: HIVE_CREATE_MANAGED_TABLE`
[...]
-| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase)
| default
| The name of the
destination database that we should sync the hudi table to.<br />`Config Param:
HIVE_DATABASE`
[...]
-|
[hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions)
| false
| Ignore exceptions when syncing with
Hive.<br />`Config Param: HIVE_IGNORE_EXCEPTIONS`
[...]
-|
[hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class)
|
org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which
implements PartitionValueExtractor to extract the partition values, default
'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param:
HIVE_PARTITION_EXTRACTOR_CLASS`
[...]
-|
[hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields)
|
| Field in the table to use for
determining hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS`
[...]
-| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword)
| hive
| hive password to use<br
/>`Config Param: HIVE_PASS`
[...]
-|
[hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix)
| false
| Skip the _ro suffix for Read
optimized table, when registering<br />`Config Param:
HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`
[...]
-|
[hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp)
| false
| ‘INT64’ with original type
TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for
backward compatibility. NOTE: On Spark entrypoints, this is defaulted to
TRUE<br />`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`
[...]
-|
[hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource)
| true
| <br />`Config Param:
HIVE_SYNC_AS_DATA_SOURCE_TABLE`
[...]
-|
[hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment)
| false
| Whether to sync the table
column comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT`
[...]
-| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable)
| unknown
| The name of the
destination table that we should sync the hudi table to.<br />`Config Param:
HIVE_TABLE`
[...]
-| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
[...]
-|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package
instead of org.apache.hudi package. Use this when you are in the process of
migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you
migrated the table definition to org.apache.hudi input format<br />`Co [...]
-| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
[...]
-| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| **Note** This is only
applicable to Spark SQL writing.<br />When operation type is set to
"insert", users can optionally enforce a dedup policy. This policy will be
employed when records being ingested already exists in storage. Default policy
is none and no action will be [...]
-|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions
like schema change or partition change.<br />`Config Param:
HIVE_CONDITIONAL_SYNC`
[...]
-|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix,
are automatically added to the commit/deltacommit metadata. This is useful to
store checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
[...]
-|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
[...]
-|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. <br /> **Note** Just for Insert operation in Spark SQL
writing since 0.14.0, users can switch to the config
`hoodie.datasource.insert.dup.po [...]
-|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator
class, that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config
Param: KEYGENERATOR_CLASS_NAME`
[...]
-|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to
true, consistent value will be generated for a logical timestamp type column,
like timestamp-millis and timestamp-micros, irrespective of whether row-writer
is enabled. Disabled by default so as not to break the pipeline that deploy
either fully row-writer path or non [...]
-|
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
| false
| Should we url encode the partition path
value, before creating the folder structure.<br />`Config Param:
URL_ENCODE_PARTITIONING`
[...]
-| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class
used. Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME`
[...]
-|
[hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema)
| false
| This config controls how
writer's schema will be selected based on the incoming batch's schema as well
as existing table's one. When schema reconciliation is DISABLED, incoming
batch's schema will be picked as a writer-schema (therefore updating table's
schema). When schema recon [...]
-|
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
|
org.apache.hudi.common.model.HoodieAvroRecordMerger | List of
HoodieMerger implementations constituting Hudi's merging strategy -- based on
the engine used. These merger impls will filter by
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient
implementation to perform merging/combining of the records (during [...]
-|
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
|
eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger
strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
[...]
-|
[hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable)
| true
| When set to true, will perform
write operations directly using the spark native `Row` representation, avoiding
any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER`
[...]
-|
[hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier)
| default_single_writer
| A stream identifier used for HUDI to fetch the right
checkpoint(`batch id` to be more specific) corresponding this writer. Please
note that keep the identifier an unique value for different writer if under
multi-writer scenario. If the value is not set, will only keep the checkpo [...]
-|
[hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction)
| false
| By default for MOR table, async compaction is enabled
with spark streaming sink. By setting this config to true, we can disable it
and the expectation is that, users will schedule and execute compaction in a
different process/job altogether. Some users may wish to run it separate [...]
-|
[hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch)
| false
| Config to indicate whether to ignore any non exception
error (e.g. writestatus error) within a streaming microbatch. Turning this on,
could hide the write status errors while the spark checkpoint moves ahead.So,
would recommend users to use this with caution.<br />`Config Param: [...]
-|
[hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount)
| 3
| Config to indicate how many times
streaming job should retry for a failed micro batch.<br />`Config Param:
STREAMING_RETRY_CNT`
[...]
-|
[hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms)
| 2000
| Config to indicate how long (by millisecond)
before a retry should issued for failed microbatch<br />`Config Param:
STREAMING_RETRY_INTERVAL_MS`
[...]
-| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass)
|
org.apache.hudi.hive.HiveSyncTool | Sync tool class
name used to sync to metastore. Defaults to Hive.<br />`Config Param:
META_SYNC_CLIENT_TOOL_CLASS_NAME`
[...]
-| [hoodie.spark.sql.insert.into.operation](#hoodiesparksqlinsertintooperation)
| insert
| Sql write operation to use
with INSERT_INTO spark sql command. This comes with 3 possible values,
bulk_insert, insert and upsert. bulk_insert is generally meant for initial
loads and is known to be performant compared to insert. But bulk_insert may not
do small file management. I [...]
-|
[hoodie.spark.sql.optimized.writes.enable](#hoodiesparksqloptimizedwritesenable)
| true
| Controls whether spark sql
prepped update, delete, and merge are enabled.<br />`Config Param:
SPARK_SQL_OPTIMIZED_WRITES`<br />`Since Version: 0.14.0`
[...]
-| [hoodie.sql.bulk.insert.enable](#hoodiesqlbulkinsertenable)
| false
| When set to true, the sql
insert statement will use bulk insert. This config is deprecated as of 0.14.0.
Please use hoodie.spark.sql.insert.into.operation instead.<br />`Config Param:
SQL_ENABLE_BULK_INSERT`
[...]
-| [hoodie.sql.insert.mode](#hoodiesqlinsertmode)
| upsert
| Insert mode when insert
data to pk-table. The optional modes are: upsert, strict and non-strict.For
upsert mode, insert statement do the upsert operation for the pk-table which
will update the duplicate record.For strict mode, insert statement will keep
the primary key uniqueness [...]
-|
[hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass)
|
io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is
used by kafka client to deserialize the records<br />`Config Param:
KAFKA_AVRO_VALUE_DESERIALIZER_CLASS`<br />`Since Version: 0.9.0`
[...]
-|
[hoodie.write.set.null.for.missing.columns](#hoodiewritesetnullformissingcolumns)
| false
| When a nullable column is
missing from incoming batch during a write operation, the write operation will
fail schema compatibility check. Set this option to true will make the missing
column be filled with null values to successfully complete the write
operation.<br />`Config P [...]
+| Config Name
| Default
| Description
[...]
+|
------------------------------------------------------------------------------------------------------------------------------------------------
| -------------------------------------------------------- |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+|
[hoodie.datasource.hive_sync.serde_properties](#hoodiedatasourcehive_syncserde_properties)
| (N/A)
| Serde properties to hive table.<br
/>`Config Param: HIVE_TABLE_SERDE_PROPERTIES`
[...]
+|
[hoodie.datasource.hive_sync.table_properties](#hoodiedatasourcehive_synctable_properties)
| (N/A)
| Additional properties to store with
table.<br />`Config Param: HIVE_TABLE_PROPERTIES`
[...]
+| [hoodie.datasource.overwrite.mode](#hoodiedatasourceoverwritemode)
| (N/A)
| Controls whether overwrite use
dynamic or static mode, if not configured, respect
spark.sql.sources.partitionOverwriteMode<br />`Config Param: OVERWRITE_MODE`<br
/>`Since Version: 0.14.0`
[...]
+|
[hoodie.datasource.write.partitions.to.delete](#hoodiedatasourcewritepartitionstodelete)
| (N/A)
| Comma separated list of partitions to
delete. Allows use of wildcard *<br />`Config Param: PARTITIONS_TO_DELETE`
[...]
+| [hoodie.datasource.write.table.name](#hoodiedatasourcewritetablename)
| (N/A)
| Table name for the datasource
write. Also used to register the table into meta stores.<br />`Config Param:
TABLE_NAME`
[...]
+|
[hoodie.datasource.compaction.async.enable](#hoodiedatasourcecompactionasyncenable)
| true
| Controls whether async compaction
should be turned on for MOR table writing.<br />`Config Param:
ASYNC_COMPACT_ENABLE`
[...]
+|
[hoodie.datasource.hive_sync.assume_date_partitioning](#hoodiedatasourcehive_syncassume_date_partitioning)
| false
| Assume partitioning is yyyy/MM/dd<br />`Config Param:
HIVE_ASSUME_DATE_PARTITION`
[...]
+|
[hoodie.datasource.hive_sync.auto_create_database](#hoodiedatasourcehive_syncauto_create_database)
| true
| Auto create hive database if does not exists<br
/>`Config Param: HIVE_AUTO_CREATE_DATABASE`
[...]
+|
[hoodie.datasource.hive_sync.base_file_format](#hoodiedatasourcehive_syncbase_file_format)
| PARQUET
| Base file format for the sync.<br />`Config
Param: HIVE_BASE_FILE_FORMAT`
[...]
+| [hoodie.datasource.hive_sync.batch_num](#hoodiedatasourcehive_syncbatch_num)
| 1000
| The number of partitions one
batch when synchronous partitions to hive.<br />`Config Param:
HIVE_BATCH_SYNC_PARTITION_NUM`
[...]
+|
[hoodie.datasource.hive_sync.bucket_sync](#hoodiedatasourcehive_syncbucket_sync)
| false
| Whether sync hive metastore
bucket specification when using bucket index.The specification is 'CLUSTERED BY
(trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'<br />`Config Param:
HIVE_SYNC_BUCKET_SYNC`
[...]
+|
[hoodie.datasource.hive_sync.create_managed_table](#hoodiedatasourcehive_synccreate_managed_table)
| false
| Whether to sync the table as managed table.<br
/>`Config Param: HIVE_CREATE_MANAGED_TABLE`
[...]
+| [hoodie.datasource.hive_sync.database](#hoodiedatasourcehive_syncdatabase)
| default
| The name of the destination
database that we should sync the hudi table to.<br />`Config Param:
HIVE_DATABASE`
[...]
+|
[hoodie.datasource.hive_sync.ignore_exceptions](#hoodiedatasourcehive_syncignore_exceptions)
| false
| Ignore exceptions when syncing with Hive.<br
/>`Config Param: HIVE_IGNORE_EXCEPTIONS`
[...]
+|
[hoodie.datasource.hive_sync.partition_extractor_class](#hoodiedatasourcehive_syncpartition_extractor_class)
|
org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which
implements PartitionValueExtractor to extract the partition values, default
'org.apache.hudi.hive.MultiPartKeysValueExtractor'.<br />`Config Param:
HIVE_PARTITION_EXTRACTOR_CLASS`
[...]
+|
[hoodie.datasource.hive_sync.partition_fields](#hoodiedatasourcehive_syncpartition_fields)
|
| Field in the table to use for determining
hive partition columns.<br />`Config Param: HIVE_PARTITION_FIELDS`
[...]
+| [hoodie.datasource.hive_sync.password](#hoodiedatasourcehive_syncpassword)
| hive
| hive password to use<br
/>`Config Param: HIVE_PASS`
[...]
+|
[hoodie.datasource.hive_sync.skip_ro_suffix](#hoodiedatasourcehive_syncskip_ro_suffix)
| false
| Skip the _ro suffix for Read optimized
table, when registering<br />`Config Param:
HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE`
[...]
+|
[hoodie.datasource.hive_sync.support_timestamp](#hoodiedatasourcehive_syncsupport_timestamp)
| false
| ‘INT64’ with original type TIMESTAMP_MICROS
is converted to hive ‘timestamp’ type. Disabled by default for backward
compatibility. NOTE: On Spark entrypoints, this is defaulted to TRUE<br
/>`Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE`
[...]
+|
[hoodie.datasource.hive_sync.sync_as_datasource](#hoodiedatasourcehive_syncsync_as_datasource)
| true
| <br />`Config Param:
HIVE_SYNC_AS_DATA_SOURCE_TABLE`
[...]
+|
[hoodie.datasource.hive_sync.sync_comment](#hoodiedatasourcehive_syncsync_comment)
| false
| Whether to sync the table column
comments while syncing the table.<br />`Config Param: HIVE_SYNC_COMMENT`
[...]
+| [hoodie.datasource.hive_sync.table](#hoodiedatasourcehive_synctable)
| unknown
| The name of the destination
table that we should sync the hudi table to.<br />`Config Param: HIVE_TABLE`
[...]
+| [hoodie.datasource.hive_sync.use_jdbc](#hoodiedatasourcehive_syncuse_jdbc)
| true
| Use JDBC when hive
synchronization is enabled<br />`Config Param: HIVE_USE_JDBC`
[...]
+|
[hoodie.datasource.hive_sync.use_pre_apache_input_format](#hoodiedatasourcehive_syncuse_pre_apache_input_format)
| false
| Flag to choose InputFormat under com.uber.hoodie package instead
of org.apache.hudi package. Use this when you are in the process of migrating
from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the
table definition to org.apache.hudi input format<br />`Config [...]
+| [hoodie.datasource.hive_sync.username](#hoodiedatasourcehive_syncusername)
| hive
| hive user name to use<br
/>`Config Param: HIVE_USER`
[...]
+| [hoodie.datasource.insert.dup.policy](#hoodiedatasourceinsertduppolicy)
| none
| **Note** This is only
applicable to Spark SQL writing.<br />When operation type is set to
"insert", users can optionally enforce a dedup policy. This policy will be
employed when records being ingested already exists in storage. Default policy
is none and no action will be tak [...]
+|
[hoodie.datasource.meta_sync.condition.sync](#hoodiedatasourcemeta_syncconditionsync)
| false
| If true, only sync on conditions like
schema change or partition change.<br />`Config Param: HIVE_CONDITIONAL_SYNC`
[...]
+|
[hoodie.datasource.write.commitmeta.key.prefix](#hoodiedatasourcewritecommitmetakeyprefix)
| _
| Option keys beginning with this prefix, are
automatically added to the commit/deltacommit metadata. This is useful to store
checkpointing information, in a consistent way with the hudi timeline<br
/>`Config Param: COMMIT_METADATA_KEYPREFIX`
[...]
+|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false
| When set to true, will not write the
partition columns into hudi. By default, false.<br />`Config Param:
DROP_PARTITION_COLUMNS`
[...]
+|
[hoodie.datasource.write.insert.drop.duplicates](#hoodiedatasourcewriteinsertdropduplicates)
| false
| If set to true, records from the incoming
dataframe will not overwrite existing records with the same key during the
write operation. <br /> **Note** Just for Insert operation in Spark SQL
writing since 0.14.0, users can switch to the config
`hoodie.datasource.insert.dup.policy [...]
+|
[hoodie.datasource.write.keygenerator.class](#hoodiedatasourcewritekeygeneratorclass)
|
org.apache.hudi.keygen.SimpleKeyGenerator | Key generator class,
that implements `org.apache.hudi.keygen.KeyGenerator`<br />`Config Param:
KEYGENERATOR_CLASS_NAME`
[...]
+|
[hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled](#hoodiedatasourcewritekeygeneratorconsistentlogicaltimestampenabled)
| false | When set to true,
consistent value will be generated for a logical timestamp type column, like
timestamp-millis and timestamp-micros, irrespective of whether row-writer is
enabled. Disabled by default so as not to break the pipeline that deploy either
fully row-writer path or non row [...]
+|
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
| false
| Should we url encode the partition path value,
before creating the folder structure.<br />`Config Param:
URL_ENCODE_PARTITIONING`
[...]
+| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.DefaultHoodieRecordPayload | Payload class used.
Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param: PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.datasource.write.payload.type](#hoodiedatasourcewritepayloadtype)
|
HOODIE_AVRO_DEFAULT |
org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging
records AWS_DMS_AVRO: Provides support for seamlessly applying changes
captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A
payload to wrap a existing Hoodie Avro Record. Useful to cr [...]
+|
[hoodie.datasource.write.reconcile.schema](#hoodiedatasourcewritereconcileschema)
| false
| This config controls how writer's
schema will be selected based on the incoming batch's schema as well as
existing table's one. When schema reconciliation is DISABLED, incoming batch's
schema will be picked as a writer-schema (therefore updating table's schema).
When schema reconcili [...]
+|
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
|
org.apache.hudi.common.model.HoodieAvroRecordMerger | List of HoodieMerger
implementations constituting Hudi's merging strategy -- based on the engine
used. These merger impls will filter by
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient
implementation to perform merging/combining of the records (during upd [...]
+|
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
|
eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 | Id of merger
strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
[...]
+|
[hoodie.datasource.write.row.writer.enable](#hoodiedatasourcewriterowwriterenable)
| true
| When set to true, will perform
write operations directly using the spark native `Row` representation, avoiding
any additional conversion costs.<br />`Config Param: ENABLE_ROW_WRITER`
[...]
+|
[hoodie.datasource.write.streaming.checkpoint.identifier](#hoodiedatasourcewritestreamingcheckpointidentifier)
| default_single_writer
| A stream identifier used for HUDI to fetch the right
checkpoint(`batch id` to be more specific) corresponding this writer. Please
note that keep the identifier an unique value for different writer if under
multi-writer scenario. If the value is not set, will only keep the checkpoint
[...]
+|
[hoodie.datasource.write.streaming.disable.compaction](#hoodiedatasourcewritestreamingdisablecompaction)
| false
| By default for MOR table, async compaction is enabled
with spark streaming sink. By setting this config to true, we can disable it
and the expectation is that, users will schedule and execute compaction in a
different process/job altogether. Some users may wish to run it separately t
[...]
+|
[hoodie.datasource.write.streaming.ignore.failed.batch](#hoodiedatasourcewritestreamingignorefailedbatch)
| false
| Config to indicate whether to ignore any non exception
error (e.g. writestatus error) within a streaming microbatch. Turning this on,
could hide the write status errors while the spark checkpoint moves ahead.So,
would recommend users to use this with caution.<br />`Config Param: STRE [...]
+|
[hoodie.datasource.write.streaming.retry.count](#hoodiedatasourcewritestreamingretrycount)
| 3
| Config to indicate how many times streaming
job should retry for a failed micro batch.<br />`Config Param:
STREAMING_RETRY_CNT`
[...]
+|
[hoodie.datasource.write.streaming.retry.interval.ms](#hoodiedatasourcewritestreamingretryintervalms)
| 2000
| Config to indicate how long (by millisecond) before a
retry should issued for failed microbatch<br />`Config Param:
STREAMING_RETRY_INTERVAL_MS`
[...]
+| [hoodie.meta.sync.client.tool.class](#hoodiemetasyncclienttoolclass)
|
org.apache.hudi.hive.HiveSyncTool | Sync tool class name
used to sync to metastore. Defaults to Hive.<br />`Config Param:
META_SYNC_CLIENT_TOOL_CLASS_NAME`
[...]
+| [hoodie.spark.sql.insert.into.operation](#hoodiesparksqlinsertintooperation)
| insert
| Sql write operation to use
with INSERT_INTO spark sql command. This comes with 3 possible values,
bulk_insert, insert and upsert. bulk_insert is generally meant for initial
loads and is known to be performant compared to insert. But bulk_insert may not
do small file management. If yo [...]
+|
[hoodie.spark.sql.optimized.writes.enable](#hoodiesparksqloptimizedwritesenable)
| true
| Controls whether spark sql
prepped update, delete, and merge are enabled.<br />`Config Param:
SPARK_SQL_OPTIMIZED_WRITES`<br />`Since Version: 0.14.0`
[...]
+| [hoodie.sql.bulk.insert.enable](#hoodiesqlbulkinsertenable)
| false
| When set to true, the sql
insert statement will use bulk insert. This config is deprecated as of 0.14.0.
Please use hoodie.spark.sql.insert.into.operation instead.<br />`Config Param:
SQL_ENABLE_BULK_INSERT`
[...]
+| [hoodie.sql.insert.mode](#hoodiesqlinsertmode)
| upsert
| Insert mode when insert data
to pk-table. The optional modes are: upsert, strict and non-strict.For upsert
mode, insert statement do the upsert operation for the pk-table which will
update the duplicate record.For strict mode, insert statement will keep the
primary key uniqueness con [...]
+|
[hoodie.streamer.source.kafka.value.deserializer.class](#hoodiestreamersourcekafkavaluedeserializerclass)
|
io.confluent.kafka.serializers.KafkaAvroDeserializer | This class is used
by kafka client to deserialize the records<br />`Config Param:
KAFKA_AVRO_VALUE_DESERIALIZER_CLASS`<br />`Since Version: 0.9.0`
[...]
+|
[hoodie.write.set.null.for.missing.columns](#hoodiewritesetnullformissingcolumns)
| false
| When a nullable column is missing
from incoming batch during a write operation, the write operation will fail
schema compatibility check. Set this option to true will make the missing
column be filled with null values to successfully complete the write
operation.<br />`Config Param [...]
---
@@ -936,7 +938,8 @@ Configurations that control write behavior on Hudi tables.
These can be directly
| [hoodie.consistency.check.max_checks](#hoodieconsistencycheckmax_checks)
| 7
| Maximum number of checks, for consistency
of written data.<br />`Config Param: MAX_CONSISTENCY_CHECKS`
[...]
|
[hoodie.consistency.check.max_interval_ms](#hoodieconsistencycheckmax_interval_ms)
| 300000
| Max time to wait between successive attempts
at performing consistency checks<br />`Config Param:
MAX_CONSISTENCY_CHECK_INTERVAL_MS`
[...]
|
[hoodie.datasource.write.keygenerator.type](#hoodiedatasourcewritekeygeneratortype)
| SIMPLE
| **Note** This is being actively worked on.
Please use `hoodie.datasource.write.keygenerator.class` instead.
org.apache.hudi.keygen.constant.KeyGeneratorType: Key generator type,
indicating the key generator class to use, that implements
`org.apache.hudi.keygen.KeyGenerator`. SIMPLE(default) [...]
-| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | Payload class
used. Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param:
WRITE_PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.datasource.write.payload.class](#hoodiedatasourcewritepayloadclass)
|
org.apache.hudi.common.model.DefaultHoodieRecordPayload | Payload class
used. Override this, if you like to roll your own merge logic, when
upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br />`Config Param:
WRITE_PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.datasource.write.payload.type](#hoodiedatasourcewritepayloadtype)
| HOODIE_AVRO_DEFAULT
|
org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging
records AWS_DMS_AVRO: Provides support for seamlessly applying changes
captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A
payload to wrap a existing Hoodie Avro Record. Useful to create a HoodieRe [...]
|
[hoodie.datasource.write.record.merger.impls](#hoodiedatasourcewriterecordmergerimpls)
|
org.apache.hudi.common.model.HoodieAvroRecordMerger | List of
HoodieMerger implementations constituting Hudi's merging strategy -- based on
the engine used. These merger impls will filter by
hoodie.datasource.write.record.merger.strategy Hudi will pick most efficient
implementation to perform merging/combining of the records (during update,
readin [...]
|
[hoodie.datasource.write.record.merger.strategy](#hoodiedatasourcewriterecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
| Id of merger strategy. Hudi will pick HoodieRecordMerger
implementations in hoodie.datasource.write.record.merger.impls which has the
same merger strategy id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since
Version: 0.13.0`
[...]
|
[hoodie.datasource.write.schema.allow.auto.evolution.column.drop](#hoodiedatasourcewriteschemaallowautoevolutioncolumndrop)
| false |
Controls whether table's schema is allowed to automatically evolve when
incoming batch's schema can have any of the columns dropped. By default, Hudi
will not allow this kind of (auto) schema evolution. Set this config to true to
allow table's schema to be updated automatically when columns are [...]
@@ -1691,7 +1694,7 @@ Payload related configs, that can be leveraged to control
merges based on specif
| Config Name | Default
| Description
|
| ---------------------------------------------------------------- |
------------------------------------------------------------ |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
-| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) |
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload | This needs to be
same as class used during insert/upserts. Just like writing, compaction also
uses the record payload class to merge records in the log against each other,
merge again with the base file and produce the final record to be written after
compaction.<br />`Config Param: PAYLOAD_CLASS_NAME` |
+| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) |
org.apache.hudi.common.model.DefaultHoodieRecordPayload | This needs to be
same as class used during insert/upserts. Just like writing, compaction also
uses the record payload class to merge records in the log against each other,
merge again with the base file and produce the final record to be written after
compaction.<br />`Config Param: PAYLOAD_CLASS_NAME` |
| [hoodie.payload.event.time.field](#hoodiepayloadeventtimefield) | ts
| Table column/field name to
derive timestamp associated with the records. This canbe useful for e.g,
determining the freshness of the table.<br />`Config Param: EVENT_TIME_FIELD`
|
| [hoodie.payload.ordering.field](#hoodiepayloadorderingfield) | ts
| Table column/field name to
order records that have the same key, before merging and writing to storage.<br
/>`Config Param: ORDERING_FIELD`
|
---