This is an automated email from the ASF dual-hosted git repository.
danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new e3894931de4 [DOCS] Added configurations of Hudi table, file-based SQL
source, Hudi error table, and timestamp key generator to configuration listing
(#11058)
e3894931de4 is described below
commit e3894931de489f222730972d76f783ffd67cccac
Author: Geser Dugarov <[email protected]>
AuthorDate: Sat Apr 20 07:44:31 2024 +0700
[DOCS] Added configurations of Hudi table, file-based SQL source, Hudi
error table, and timestamp key generator to configuration listing (#11058)
---
website/docs/basic_configurations.md | 91 ++++++++++++++++++++++++-
website/docs/configurations.md | 125 ++++++++++++++++++++++++++++++++++-
2 files changed, 214 insertions(+), 2 deletions(-)
diff --git a/website/docs/basic_configurations.md
b/website/docs/basic_configurations.md
index 2f18ad3e885..1fc301521e1 100644
--- a/website/docs/basic_configurations.md
+++ b/website/docs/basic_configurations.md
@@ -1,12 +1,13 @@
---
title: Basic Configurations
summary: This page covers the basic configurations you may use to write/read
Hudi tables. This page only features a subset of the most frequently used
configurations. For a full list of all configs, please visit the [All
Configurations](/docs/configurations) page.
-last_modified_at: 2024-04-15T09:56:05.413
+last_modified_at: 2024-04-19T18:21:42.88
---
This page covers the basic configurations you may use to write/read Hudi
tables. This page only features a subset of the most frequently used
configurations. For a full list of all configs, please visit the [All
Configurations](/docs/configurations) page.
+- [**Hudi Table Config**](#TABLE_CONFIG): Basic Hudi Table configuration
parameters.
- [**Spark Datasource Configs**](#SPARK_DATASOURCE): These configs control the
Hudi Spark Datasource, providing ability to define keys/partitioning, pick out
the write operation, specify how to merge records or choosing query type to
read.
- [**Flink Sql Configs**](#FLINK_SQL): These configs control the Hudi Flink
SQL source/sink connectors, providing ability to define record keys, pick out
the write operation, specify how to merge records, enable/disable asynchronous
compaction or choosing query type to read.
- [**Write Client Configs**](#WRITE_CLIENT): Internally, the Hudi datasource
uses a RDD based HoodieWriteClient API to actually perform writes to storage.
These configs provide deep control over lower level aspects like file sizing,
compression, parallelism, compaction, write schema, cleaning etc. Although Hudi
provides sane defaults, from time-time these configs may need to be tweaked to
optimize for specific workloads.
@@ -20,6 +21,56 @@ This page covers the basic configurations you may use to
write/read Hudi tables.
In the tables below **(N/A)** means there is no default value set
:::
+## Hudi Table Config {#TABLE_CONFIG}
+Basic Hudi Table configuration parameters.
+
+
+### Hudi Table Basic Configs {#Hudi-Table-Basic-Configs}
+Configurations of the Hudi Table like type of ingestion, storage formats, hive
table name etc. Configurations are loaded from hoodie.properties, these
properties are usually set during initializing a path as hoodie base path and
never changes during the lifetime of a hoodie table.
+
+
+
+
+[**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs)
+
+
+| Config Name
| Default
| Description
[...]
+|
------------------------------------------------------------------------------------------------
| ----------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath)
| (N/A)
| Base path of the dataset that needs to be bootstrapped as a Hudi table<br
/>`Config Param: BOOTSTRAP_BASE_PATH`
[...]
+| [hoodie.database.name](#hoodiedatabasename)
| (N/A)
| Database name that will be used for incremental query.If different
databases have the same table name during incremental query, we can set it to
limit the table name under a specific database<br />`Config Param:
DATABASE_NAME`
[...]
+| [hoodie.table.checksum](#hoodietablechecksum)
| (N/A)
| Table checksum is used to guard against partial writes in HDFS. It is added
as the last entry in hoodie.properties and then used to validate while reading
table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version: 0.11.0`
[...]
+| [hoodie.table.create.schema](#hoodietablecreateschema)
| (N/A)
| Schema used when creating the table, for the first time.<br />`Config
Param: CREATE_SCHEMA`
[...]
+| [hoodie.table.index.defs.path](#hoodietableindexdefspath)
| (N/A)
| Absolute path where the index definitions are stored<br />`Config Param:
INDEX_DEFINITION_PATH`<br />`Since Version: 1.0.0`
[...]
+| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass)
| (N/A)
| Key Generator class property for the hoodie table<br />`Config Param:
KEY_GENERATOR_CLASS_NAME`
[...]
+| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype)
| (N/A)
| Key Generator type to determine key generator class<br />`Config Param:
KEY_GENERATOR_TYPE`<br />`Since Version: 1.0.0`
[...]
+| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions)
| (N/A)
| Comma-separated list of metadata partitions that have been completely built
and in-sync with data table. These partitions are ready for use by the
readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version:
0.11.0`
[...]
+|
[hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight)
| (N/A) |
Comma-separated list of metadata partitions whose building is in progress.
These partitions are not yet ready for use by the readers.<br />`Config Param:
TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0`
[...]
+| [hoodie.table.name](#hoodietablename)
| (N/A)
| Table name that will be used for registering with Hive. Needs to be same
across runs.<br />`Config Param: NAME`
[...]
+| [hoodie.table.partition.fields](#hoodietablepartitionfields)
| (N/A)
| Fields used to partition the table. Concatenated values of these fields are
used as the partition path, by invoking toString()<br />`Config Param:
PARTITION_FIELDS`
[...]
+| [hoodie.table.precombine.field](#hoodietableprecombinefield)
| (N/A)
| Field used in preCombining before actual write. By default, when two
records have the same key value, the largest value for the precombine field
determined by Object.compareTo(..), is picked.<br />`Config Param:
PRECOMBINE_FIELD`
[...]
+| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields)
| (N/A)
| Columns used to uniquely identify the table. Concatenated values of these
fields are used as the record key component of HoodieKey.<br />`Config Param:
RECORDKEY_FIELDS`
[...]
+|
[hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata)
| (N/A)
| The metadata of secondary indexes<br />`Config Param:
SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0`
[...]
+| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion)
| (N/A)
| Version of timeline used, by the table.<br />`Config Param:
TIMELINE_LAYOUT_VERSION`
[...]
+| [hoodie.archivelog.folder](#hoodiearchivelogfolder)
| archived
| path under the meta folder, to store archived timeline instants at.<br
/>`Config Param: ARCHIVELOG_FOLDER`
[...]
+| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass)
|
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex | Implementation to
use, for mapping base files to bootstrap base file, that contain actual
data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
[...]
+| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable)
| true
| Whether or not, this is a bootstrapped table, with bootstrap base data and
an mapping index defined, default true.<br />`Config Param:
BOOTSTRAP_INDEX_ENABLE`
[...]
+| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype)
| HFILE
| Bootstrap index type determines which implementation to use, for mapping
base files to bootstrap base file, that contain actual data.<br />`Config
Param: BOOTSTRAP_INDEX_TYPE`<br />`Since Version: 1.0.0`
[...]
+| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass)
| org.apache.hudi.common.model.DefaultHoodieRecordPayload
| Payload class to use for performing compactions, i.e merge delta logs with
current base file and then produce a new base file.<br />`Config Param:
PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.compaction.payload.type](#hoodiecompactionpayloadtype)
| HOODIE_AVRO_DEFAULT
| org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging
records AWS_DMS_AVRO: Provides support for seamlessly applying changes
captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A
payload to wrap a existing Hoodie Avro Record. Useful to create a HoodieRecord
over existing GenericReco [...]
+|
[hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 |
Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
[...]
+|
[hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning)
| false | Flag to
indicate whether to use Hive style partitioning. If set true, the names of
partition folders follow <partition_column_name>=<partition_value>
format. By default false (the names of partition folders are only partition
values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
[...]
+|
[hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat)
| false |
If true, partition metafiles are saved in the same format as base-files for
this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are
saved as properties files.<br />`Config Param:
PARTITION_METAFILE_USE_BASE_FORMAT`
[...]
+| [hoodie.populate.meta.fields](#hoodiepopulatemetafields)
| true
| When enabled, populates all meta fields. When disabled, no meta fields are
populated and incremental queries will not be functional. This is only meant to
be used for append only/immutable data for batch processing<br />`Config Param:
POPULATE_META_FIELDS`
[...]
+| [hoodie.table.base.file.format](#hoodietablebasefileformat)
| PARQUET
| Base file format to store all the base file data.<br />`Config Param:
BASE_FILE_FORMAT`
[...]
+| [hoodie.table.cdc.enabled](#hoodietablecdcenabled)
| false
| When enable, persist the change data if necessary, and can be queried as a
CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version: 0.13.0`
[...]
+|
[hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode)
| DATA_BEFORE_AFTER |
org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log
capture supplemental logging mode. The supplemental log is used for
accelerating the generation of change log details. OP_KEY_ONLY: Only
keeping record keys in the supplemental logs, so the reader needs to figure out
the update before image and af [...]
+| [hoodie.table.log.file.format](#hoodietablelogfileformat)
| HOODIE_LOG
| Log format used for the delta logs.<br />`Config Param: LOG_FILE_FORMAT`
[...]
+|
[hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable)
| false | When set
to true, the table can support reading and writing multiple base file
formats.<br />`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`<br />`Since
Version: 1.0.0`
[...]
+| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone)
| LOCAL
| User can set hoodie commit timeline timezone, such as utc, local and so on.
local is default<br />`Config Param: TIMELINE_TIMEZONE`
[...]
+| [hoodie.table.type](#hoodietabletype)
| COPY_ON_WRITE
| The table type for the underlying data, for this write. This can’t change
between writes.<br />`Config Param: TYPE`
[...]
+| [hoodie.table.version](#hoodietableversion)
| ZERO
| Version of table, used for running upgrade/downgrade steps between releases
with potentially breaking/backwards compatible changes.<br />`Config Param:
VERSION`
[...]
+---
+
## Spark Datasource Configs {#SPARK_DATASOURCE}
These configs control the Hudi Spark Datasource, providing ability to define
keys/partitioning, pick out the write operation, specify how to merge records
or choosing query type to read.
@@ -270,6 +321,29 @@ Configurations that control compaction (merging of log
files onto a new base fil
---
+### Error table Configs {#Error-table-Configs}
+Configurations that are required for Error table configs
+
+
+
+
+[**Basic Configs**](#Error-table-Configs-basic-configs)
+
+
+| Config Name
| Default | Description
|
+|
-------------------------------------------------------------------------------------------------
| ---------------- |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| [hoodie.errortable.base.path](#hoodieerrortablebasepath)
| (N/A) | Base path for error table under which
all error records would be stored.<br />`Config Param: ERROR_TABLE_BASE_PATH`
|
+| [hoodie.errortable.target.table.name](#hoodieerrortabletargettablename)
| (N/A) | Table name to be used for the error
table<br />`Config Param: ERROR_TARGET_TABLE`
|
+| [hoodie.errortable.write.class](#hoodieerrortablewriteclass)
| (N/A) | Class which handles the error table
writes. This config is used to configure a custom implementation for Error
Table Writer. Specify the full class name of the custom error table writer as a
value for this config<br />`Config Param: ERROR_TABLE_WRITE_CLASS`
|
+| [hoodie.errortable.enable](#hoodieerrortableenable)
| false | Config to enable error table. If the
config is enabled, all the records with processing error in DeltaStreamer are
transferred to error table.<br />`Config Param: ERROR_TABLE_ENABLED`
|
+|
[hoodie.errortable.insert.shuffle.parallelism](#hoodieerrortableinsertshuffleparallelism)
| 200 | Config to set insert shuffle parallelism. The
config is similar to hoodie.insert.shuffle.parallelism config but applies to
the error table.<br />`Config Param: ERROR_TABLE_INSERT_PARALLELISM_VALUE`
|
+|
[hoodie.errortable.upsert.shuffle.parallelism](#hoodieerrortableupsertshuffleparallelism)
| 200 | Config to set upsert shuffle parallelism. The
config is similar to hoodie.upsert.shuffle.parallelism config but applies to
the error table.<br />`Config Param: ERROR_TABLE_UPSERT_PARALLELISM_VALUE`
|
+|
[hoodie.errortable.validate.recordcreation.enable](#hoodieerrortablevalidaterecordcreationenable)
| true | Records that fail to be created due to keygeneration
failure or other issues will be sent to the Error Table<br />`Config Param:
ERROR_ENABLE_VALIDATE_RECORD_CREATION`<br />`Since Version: 0.14.2`
|
+|
[hoodie.errortable.validate.targetschema.enable](#hoodieerrortablevalidatetargetschemaenable)
| false | Records with schema mismatch with Target Schema are
sent to Error Table.<br />`Config Param: ERROR_ENABLE_VALIDATE_TARGET_SCHEMA`
|
+|
[hoodie.errortable.write.failure.strategy](#hoodieerrortablewritefailurestrategy)
| ROLLBACK_COMMIT | The config specifies the failure strategy
if error table write fails. Use one of - [ROLLBACK_COMMIT (Rollback the
corresponding base table write commit for which the error events were
triggered) , LOG_ERROR (Error is logged but the base table write succeeds) ]<br
/>`Config Param: ERROR_TABLE_WRITE_FAILURE_STRATEGY` |
+---
+
+
### Write Configurations {#Write-Configurations}
Configurations that control write behavior on Hudi tables. These can be
directly passed down from even higher level frameworks (e.g Spark datasources,
Flink sink) and utilities (e.g Hudi Streamer).
@@ -623,6 +697,21 @@ Configurations controlling the behavior of S3 source in
Hudi Streamer.
---
+#### File-based SQL Source Configs {#File-based-SQL-Source-Configs}
+Configurations controlling the behavior of File-based SQL Source in Hudi
Streamer.
+
+
+
+
+[**Basic Configs**](#File-based-SQL-Source-Configs-basic-configs)
+
+
+| Config Name | Default |
Description
|
+| --------------------------------------------------------------- | ------- |
-----------------------------------------------------------------------------------------------------------------------------
|
+| [hoodie.streamer.source.sql.file](#hoodiestreamersourcesqlfile) | (N/A) |
SQL file path containing the SQL query to read source data.<br />`Config Param:
SOURCE_SQL_FILE`<br />`Since Version: 0.14.0` |
+---
+
+
#### SQL Source Configs {#SQL-Source-Configs}
Configurations controlling the behavior of SQL source in Hudi Streamer.
diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index 1271fb9822c..728a1e61409 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -5,12 +5,13 @@ permalink: /docs/configurations.html
summary: This page covers the different ways of configuring your job to
write/read Hudi tables. At a high level, you can control behaviour at few
levels.
toc_min_heading_level: 2
toc_max_heading_level: 4
-last_modified_at: 2024-04-15T09:56:05.395
+last_modified_at: 2024-04-19T18:21:42.86
---
This page covers the different ways of configuring your job to write/read Hudi
tables. At a high level, you can control behaviour at few levels.
+- [**Hudi Table Config**](#TABLE_CONFIG): Basic Hudi Table configuration
parameters.
- [**Environment Config**](#ENVIRONMENT_CONFIG): Hudi supports passing
configurations via a configuration file `hudi-default.conf` in which each line
consists of a key and a value separated by whitespace or = sign. For example:
```
hoodie.datasource.hive_sync.mode jdbc
@@ -42,6 +43,63 @@ file `hudi-default.conf`. By default, Hudi would load the
configuration file und
specify a different configuration directory location by setting the
`HUDI_CONF_DIR` environment variable. This can be
useful for uniformly enforcing repeated configs (like Hive sync or write/index
tuning), across your entire data lake.
+## Hudi Table Config {#TABLE_CONFIG}
+Basic Hudi Table configuration parameters.
+
+
+### Hudi Table Basic Configs {#Hudi-Table-Basic-Configs}
+Configurations of the Hudi Table like type of ingestion, storage formats, hive
table name etc. Configurations are loaded from hoodie.properties, these
properties are usually set during initializing a path as hoodie base path and
never changes during the lifetime of a hoodie table.
+
+
+
+[**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs)
+
+
+| Config Name
| Default
| Description
[...]
+|
------------------------------------------------------------------------------------------------
| ----------------------------------------------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath)
| (N/A)
| Base path of the dataset that needs to be bootstrapped as a Hudi table<br
/>`Config Param: BOOTSTRAP_BASE_PATH`
[...]
+| [hoodie.database.name](#hoodiedatabasename)
| (N/A)
| Database name that will be used for incremental query.If different
databases have the same table name during incremental query, we can set it to
limit the table name under a specific database<br />`Config Param:
DATABASE_NAME`
[...]
+| [hoodie.table.checksum](#hoodietablechecksum)
| (N/A)
| Table checksum is used to guard against partial writes in HDFS. It is added
as the last entry in hoodie.properties and then used to validate while reading
table config.<br />`Config Param: TABLE_CHECKSUM`<br />`Since Version: 0.11.0`
[...]
+| [hoodie.table.create.schema](#hoodietablecreateschema)
| (N/A)
| Schema used when creating the table, for the first time.<br />`Config
Param: CREATE_SCHEMA`
[...]
+| [hoodie.table.index.defs.path](#hoodietableindexdefspath)
| (N/A)
| Absolute path where the index definitions are stored<br />`Config Param:
INDEX_DEFINITION_PATH`<br />`Since Version: 1.0.0`
[...]
+| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass)
| (N/A)
| Key Generator class property for the hoodie table<br />`Config Param:
KEY_GENERATOR_CLASS_NAME`
[...]
+| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype)
| (N/A)
| Key Generator type to determine key generator class<br />`Config Param:
KEY_GENERATOR_TYPE`<br />`Since Version: 1.0.0`
[...]
+| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions)
| (N/A)
| Comma-separated list of metadata partitions that have been completely built
and in-sync with data table. These partitions are ready for use by the
readers<br />`Config Param: TABLE_METADATA_PARTITIONS`<br />`Since Version:
0.11.0`
[...]
+|
[hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight)
| (N/A) |
Comma-separated list of metadata partitions whose building is in progress.
These partitions are not yet ready for use by the readers.<br />`Config Param:
TABLE_METADATA_PARTITIONS_INFLIGHT`<br />`Since Version: 0.11.0`
[...]
+| [hoodie.table.name](#hoodietablename)
| (N/A)
| Table name that will be used for registering with Hive. Needs to be same
across runs.<br />`Config Param: NAME`
[...]
+| [hoodie.table.partition.fields](#hoodietablepartitionfields)
| (N/A)
| Fields used to partition the table. Concatenated values of these fields are
used as the partition path, by invoking toString()<br />`Config Param:
PARTITION_FIELDS`
[...]
+| [hoodie.table.precombine.field](#hoodietableprecombinefield)
| (N/A)
| Field used in preCombining before actual write. By default, when two
records have the same key value, the largest value for the precombine field
determined by Object.compareTo(..), is picked.<br />`Config Param:
PRECOMBINE_FIELD`
[...]
+| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields)
| (N/A)
| Columns used to uniquely identify the table. Concatenated values of these
fields are used as the record key component of HoodieKey.<br />`Config Param:
RECORDKEY_FIELDS`
[...]
+|
[hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata)
| (N/A)
| The metadata of secondary indexes<br />`Config Param:
SECONDARY_INDEXES_METADATA`<br />`Since Version: 0.13.0`
[...]
+| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion)
| (N/A)
| Version of timeline used, by the table.<br />`Config Param:
TIMELINE_LAYOUT_VERSION`
[...]
+| [hoodie.archivelog.folder](#hoodiearchivelogfolder)
| archived
| path under the meta folder, to store archived timeline instants at.<br
/>`Config Param: ARCHIVELOG_FOLDER`
[...]
+| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass)
|
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex | Implementation to
use, for mapping base files to bootstrap base file, that contain actual
data.<br />`Config Param: BOOTSTRAP_INDEX_CLASS_NAME`
[...]
+| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable)
| true
| Whether or not, this is a bootstrapped table, with bootstrap base data and
an mapping index defined, default true.<br />`Config Param:
BOOTSTRAP_INDEX_ENABLE`
[...]
+| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype)
| HFILE
| Bootstrap index type determines which implementation to use, for mapping
base files to bootstrap base file, that contain actual data.<br />`Config
Param: BOOTSTRAP_INDEX_TYPE`<br />`Since Version: 1.0.0`
[...]
+| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass)
| org.apache.hudi.common.model.DefaultHoodieRecordPayload
| Payload class to use for performing compactions, i.e merge delta logs with
current base file and then produce a new base file.<br />`Config Param:
PAYLOAD_CLASS_NAME`
[...]
+| [hoodie.compaction.payload.type](#hoodiecompactionpayloadtype)
| HOODIE_AVRO_DEFAULT
| org.apache.hudi.common.model.RecordPayloadType: Payload to use for merging
records AWS_DMS_AVRO: Provides support for seamlessly applying changes
captured via Amazon Database Migration Service onto S3. HOODIE_AVRO: A
payload to wrap a existing Hoodie Avro Record. Useful to create a HoodieRecord
over existing GenericReco [...]
+|
[hoodie.compaction.record.merger.strategy](#hoodiecompactionrecordmergerstrategy)
| eeb8d96f-b1e4-49fd-bbf8-28ac514178e5 |
Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in
hoodie.datasource.write.record.merger.impls which has the same merger strategy
id<br />`Config Param: RECORD_MERGER_STRATEGY`<br />`Since Version: 0.13.0`
[...]
+|
[hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning)
| false | Flag to
indicate whether to use Hive style partitioning. If set true, the names of
partition folders follow <partition_column_name>=<partition_value>
format. By default false (the names of partition folders are only partition
values)<br />`Config Param: HIVE_STYLE_PARTITIONING_ENABLE`
[...]
+|
[hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat)
| false |
If true, partition metafiles are saved in the same format as base-files for
this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are
saved as properties files.<br />`Config Param:
PARTITION_METAFILE_USE_BASE_FORMAT`
[...]
+| [hoodie.populate.meta.fields](#hoodiepopulatemetafields)
| true
| When enabled, populates all meta fields. When disabled, no meta fields are
populated and incremental queries will not be functional. This is only meant to
be used for append only/immutable data for batch processing<br />`Config Param:
POPULATE_META_FIELDS`
[...]
+| [hoodie.table.base.file.format](#hoodietablebasefileformat)
| PARQUET
| Base file format to store all the base file data.<br />`Config Param:
BASE_FILE_FORMAT`
[...]
+| [hoodie.table.cdc.enabled](#hoodietablecdcenabled)
| false
| When enable, persist the change data if necessary, and can be queried as a
CDC query mode.<br />`Config Param: CDC_ENABLED`<br />`Since Version: 0.13.0`
[...]
+|
[hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode)
| DATA_BEFORE_AFTER |
org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log
capture supplemental logging mode. The supplemental log is used for
accelerating the generation of change log details. OP_KEY_ONLY: Only
keeping record keys in the supplemental logs, so the reader needs to figure out
the update before image and af [...]
+| [hoodie.table.log.file.format](#hoodietablelogfileformat)
| HOODIE_LOG
| Log format used for the delta logs.<br />`Config Param: LOG_FILE_FORMAT`
[...]
+|
[hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable)
| false | When set
to true, the table can support reading and writing multiple base file
formats.<br />`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`<br />`Since
Version: 1.0.0`
[...]
+| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone)
| LOCAL
| User can set hoodie commit timeline timezone, such as utc, local and so on.
local is default<br />`Config Param: TIMELINE_TIMEZONE`
[...]
+| [hoodie.table.type](#hoodietabletype)
| COPY_ON_WRITE
| The table type for the underlying data, for this write. This can’t change
between writes.<br />`Config Param: TYPE`
[...]
+| [hoodie.table.version](#hoodietableversion)
| ZERO
| Version of table, used for running upgrade/downgrade steps between releases
with potentially breaking/backwards compatible changes.<br />`Config Param:
VERSION`
[...]
+
+[**Advanced Configs**](#Hudi-Table-Basic-Configs-advanced-configs)
+
+
+| Config Name
| Default | Description
|
+|
-----------------------------------------------------------------------------------------------
| ------- |
---------------------------------------------------------------------------------------------------------------------------------
|
+|
[hoodie.datasource.write.drop.partition.columns](#hoodiedatasourcewritedroppartitioncolumns)
| false | When set to true, will not write the partition columns into
hudi. By default, false.<br />`Config Param: DROP_PARTITION_COLUMNS` |
+|
[hoodie.datasource.write.partitionpath.urlencode](#hoodiedatasourcewritepartitionpathurlencode)
| false | Should we url encode the partition path value, before creating the
folder structure.<br />`Config Param: URL_ENCODE_PARTITIONING` |
+---
+
## Spark Datasource Configs {#SPARK_DATASOURCE}
These configs control the Hudi Spark Datasource, providing ability to define
keys/partitioning, pick out the write operation, specify how to merge records
or choosing query type to read.
@@ -764,6 +822,28 @@ Configurations that control compaction (merging of log
files onto a new base fil
---
+### Error table Configs {#Error-table-Configs}
+Configurations that are required for Error table configs
+
+
+
+[**Basic Configs**](#Error-table-Configs-basic-configs)
+
+
+| Config Name
| Default | Description
|
+|
-------------------------------------------------------------------------------------------------
| ---------------- |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+| [hoodie.errortable.base.path](#hoodieerrortablebasepath)
| (N/A) | Base path for error table under which
all error records would be stored.<br />`Config Param: ERROR_TABLE_BASE_PATH`
|
+| [hoodie.errortable.target.table.name](#hoodieerrortabletargettablename)
| (N/A) | Table name to be used for the error
table<br />`Config Param: ERROR_TARGET_TABLE`
|
+| [hoodie.errortable.write.class](#hoodieerrortablewriteclass)
| (N/A) | Class which handles the error table
writes. This config is used to configure a custom implementation for Error
Table Writer. Specify the full class name of the custom error table writer as a
value for this config<br />`Config Param: ERROR_TABLE_WRITE_CLASS`
|
+| [hoodie.errortable.enable](#hoodieerrortableenable)
| false | Config to enable error table. If the
config is enabled, all the records with processing error in DeltaStreamer are
transferred to error table.<br />`Config Param: ERROR_TABLE_ENABLED`
|
+|
[hoodie.errortable.insert.shuffle.parallelism](#hoodieerrortableinsertshuffleparallelism)
| 200 | Config to set insert shuffle parallelism. The
config is similar to hoodie.insert.shuffle.parallelism config but applies to
the error table.<br />`Config Param: ERROR_TABLE_INSERT_PARALLELISM_VALUE`
|
+|
[hoodie.errortable.upsert.shuffle.parallelism](#hoodieerrortableupsertshuffleparallelism)
| 200 | Config to set upsert shuffle parallelism. The
config is similar to hoodie.upsert.shuffle.parallelism config but applies to
the error table.<br />`Config Param: ERROR_TABLE_UPSERT_PARALLELISM_VALUE`
|
+|
[hoodie.errortable.validate.recordcreation.enable](#hoodieerrortablevalidaterecordcreationenable)
| true | Records that fail to be created due to keygeneration
failure or other issues will be sent to the Error Table<br />`Config Param:
ERROR_ENABLE_VALIDATE_RECORD_CREATION`<br />`Since Version: 0.14.2`
|
+|
[hoodie.errortable.validate.targetschema.enable](#hoodieerrortablevalidatetargetschemaenable)
| false | Records with schema mismatch with Target Schema are
sent to Error Table.<br />`Config Param: ERROR_ENABLE_VALIDATE_TARGET_SCHEMA`
|
+|
[hoodie.errortable.write.failure.strategy](#hoodieerrortablewritefailurestrategy)
| ROLLBACK_COMMIT | The config specifies the failure strategy
if error table write fails. Use one of - [ROLLBACK_COMMIT (Rollback the
corresponding base table write commit for which the error events were
triggered) , LOG_ERROR (Error is logged but the base table write succeeds) ]<br
/>`Config Param: ERROR_TABLE_WRITE_FAILURE_STRATEGY` |
+---
+
+
### Layout Configs {#Layout-Configs}
Configurations that control storage layout and data distribution, which
defines how the files are organized within a table.
@@ -1059,6 +1139,28 @@ Hudi maintains keys (record key + partition path) for
uniquely identifying a par
---
+#### Timestamp-based key generator configs
{#Timestamp-based-key-generator-configs}
+Configs used for TimestampBasedKeyGenerator which relies on timestamps for the
partition field. The field values are interpreted as timestamps and not just
converted to string while generating partition path value for records. Record
key is same as before where it is chosen by field name.
+
+
+
+[**Advanced Configs**](#Timestamp-based-key-generator-configs-advanced-configs)
+
+
+| Config Name
| Default
| Description
|
+|
------------------------------------------------------------------------------------------------------------------------
| --------------------------------------------------- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
+|
[hoodie.keygen.timebased.timestamp.type](#hoodiekeygentimebasedtimestamptype)
| (N/A)
| Timestamp type of the field, which should be one of the
timestamp types supported: `UNIX_TIMESTAMP`, `DATE_STRING`, `MIXED`,
`EPOCHMILLISECONDS`, `SCALAR`.<br />`Config Param: TIMESTAMP_TYPE_FIELD`
|
+| [hoodie.keygen.datetime.parser.class](#hoodiekeygendatetimeparserclass)
|
org.apache.hudi.keygen.parser.HoodieDateTimeParser | Date time parser class
name.<br />`Config Param: DATE_TIME_PARSER`
|
+|
[hoodie.keygen.timebased.input.dateformat](#hoodiekeygentimebasedinputdateformat)
|
| Input date format such as `yyyy-MM-dd'T'HH:mm:ss.SSSZ`.<br
/>`Config Param: TIMESTAMP_INPUT_DATE_FORMAT`
|
+|
[hoodie.keygen.timebased.input.dateformat.list.delimiter.regex](#hoodiekeygentimebasedinputdateformatlistdelimiterregex)
| , | The delimiter for
allowed input date format list, usually `,`.<br />`Config Param:
TIMESTAMP_INPUT_DATE_FORMAT_LIST_DELIMITER_REGEX`
|
+|
[hoodie.keygen.timebased.input.timezone](#hoodiekeygentimebasedinputtimezone)
| UTC
| Timezone of the input timestamp, such as `UTC`.<br />`Config
Param: TIMESTAMP_INPUT_TIMEZONE_FORMAT`
|
+|
[hoodie.keygen.timebased.output.dateformat](#hoodiekeygentimebasedoutputdateformat)
|
| Output date format such as `yyyy-MM-dd'T'HH:mm:ss.SSSZ`.<br
/>`Config Param: TIMESTAMP_OUTPUT_DATE_FORMAT`
|
+|
[hoodie.keygen.timebased.output.timezone](#hoodiekeygentimebasedoutputtimezone)
| UTC
| Timezone of the output timestamp, such as `UTC`.<br />`Config
Param: TIMESTAMP_OUTPUT_TIMEZONE_FORMAT`
|
+|
[hoodie.keygen.timebased.timestamp.scalar.time.unit](#hoodiekeygentimebasedtimestampscalartimeunit)
| SECONDS |
When timestamp type `SCALAR` is used, this specifies the time unit, with
allowed unit specified by `TimeUnit` enums (`NANOSECONDS`, `MICROSECONDS`,
`MILLISECONDS`, `SECONDS`, `MINUTES`, `HOURS`, `DAYS`).<br />`Config Param:
INPUT_TIME_UNIT` |
+| [hoodie.keygen.timebased.timezone](#hoodiekeygentimebasedtimezone)
| UTC
| Timezone of both input and output timestamp if they are the
same, such as `UTC`. Please use `hoodie.keygen.timebased.input.timezone` and
`hoodie.keygen.timebased.output.timezone` instead if the input and output
timezones are different.<br />`Config Param: TIMESTAMP_TIMEZONE_FORMAT` |
+---
+
+
### Index Configs {#INDEX}
Configurations that control indexing behavior, which tags incoming records as
either inserts or updates to older records.
@@ -1977,6 +2079,27 @@ Configurations controlling the behavior of S3 source in
Hudi Streamer.
---
+#### File-based SQL Source Configs {#File-based-SQL-Source-Configs}
+Configurations controlling the behavior of File-based SQL Source in Hudi
Streamer.
+
+
+
+[**Basic Configs**](#File-based-SQL-Source-Configs-basic-configs)
+
+
+| Config Name | Default |
Description
|
+| --------------------------------------------------------------- | ------- |
-----------------------------------------------------------------------------------------------------------------------------
|
+| [hoodie.streamer.source.sql.file](#hoodiestreamersourcesqlfile) | (N/A) |
SQL file path containing the SQL query to read source data.<br />`Config Param:
SOURCE_SQL_FILE`<br />`Since Version: 0.14.0` |
+
+[**Advanced Configs**](#File-based-SQL-Source-Configs-advanced-configs)
+
+
+| Config Name
| Default | Description
|
+|
------------------------------------------------------------------------------------
| ------- |
-------------------------------------------------------------------------------------------------------------------------------------
|
+|
[hoodie.streamer.source.sql.checkpoint.emit](#hoodiestreamersourcesqlcheckpointemit)
| false | Whether to emit the current epoch as the streamer checkpoint.<br
/>`Config Param: EMIT_EPOCH_CHECKPOINT`<br />`Since Version: 0.14.0` |
+---
+
+
#### SQL Source Configs {#SQL-Source-Configs}
Configurations controlling the behavior of SQL source in Hudi Streamer.