pratyakshsharma commented on code in PR #4927:
URL: https://github.com/apache/hudi/pull/4927#discussion_r867343006
##########
website/docs/basic_configurations.md:
##########
@@ -0,0 +1,750 @@
+---
+title: Basic Configurations
+toc: true
+---
+
+This page covers the basic configurations you may use to write/read Hudi
tables. This page only features a subset of the
+most frequently used configurations. For a full list of all configs, please
visit the [All Configurations](/docs/configurations) page.
+
+- [**Spark Datasource Configs**](#SPARK_DATASOURCE): These configs control the
Hudi Spark Datasource, providing ability to define keys/partitioning, pick out
the write operation, specify how to merge records or choosing query type to
read.
+- [**Flink Sql Configs**](#FLINK_SQL): These configs control the Hudi Flink
SQL source/sink connectors, providing ability to define record keys, pick out
the write operation, specify how to merge records, enable/disable asynchronous
compaction or choosing query type to read.
+- [**Write Client Configs**](#WRITE_CLIENT): Internally, the Hudi datasource
uses a RDD based HoodieWriteClient API to actually perform writes to storage.
These configs provide deep control over lower level aspects like file sizing,
compression, parallelism, compaction, write schema, cleaning etc. Although Hudi
provides sane defaults, from time-time these configs may need to be tweaked to
optimize for specific workloads.
+- [**Metrics Configs**](#METRICS): These set of configs are used to enable
monitoring and reporting of keyHudi stats and metrics.
+- [**Record Payload Config**](#RECORD_PAYLOAD): This is the lowest level of
customization offered by Hudi. Record payloads define how to produce new values
to upsert based on incoming new record and stored old record. Hudi provides
default implementations such as OverwriteWithLatestAvroPayload which simply
update table with the latest/last-written record. This can be overridden to a
custom class extending HoodieRecordPayload class, on both datasource and
WriteClient levels.
+
+## Spark Datasource Configs {#SPARK_DATASOURCE}
+These configs control the Hudi Spark Datasource, providing ability to define
keys/partitioning, pick out the write operation, specify how to merge records
or choosing query type to read.
+
+### Read Options {#Read-Options}
+
+Options useful for reading tables via `read.format.option(...)`
+
+
+`Config Class`: org.apache.hudi.DataSourceOptions.scala<br></br>
+> #### hoodie.datasource.query.type
+> Whether data needs to be read, in incremental mode (new data since an
instantTime) (or) Read Optimized mode (obtain latest view, based on base files)
(or) Snapshot mode (obtain latest view, by merging base and (if any) log
files)<br></br>
+> **Default Value**: snapshot (Optional)<br></br>
+> `Config Param: QUERY_TYPE`<br></br>
+
+---
+
+### Write Options {#Write-Options}
+
+You can pass down any of the WriteClient level configs directly using
`options()` or `option(k,v)` methods.
+
+```java
+inputDF.write()
+.format("org.apache.hudi")
+.options(clientOpts) // any of the Hudi client opts can be passed in as well
+.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
+.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+.option(HoodieWriteConfig.TABLE_NAME, tableName)
+.mode(SaveMode.Append)
+.save(basePath);
+```
+
+Options useful for writing tables via `write.format.option(...)`
+
+
+`Config Class`: org.apache.hudi.DataSourceOptions.scala<br></br>
+
+> #### hoodie.datasource.write.operation
+> Whether to do upsert, insert or bulkinsert for the write operation. Use
bulkinsert to load new data into a table, and there on use upsert/insert. bulk
insert uses a disk based write path to scale to load large inputs without need
to cache it.<br></br>
+> **Default Value**: upsert (Optional)<br></br>
+> `Config Param: OPERATION`<br></br>
+
+---
+
+> #### hoodie.datasource.write.table.type
+> The table type for the underlying data, for this write. This can’t change
between writes.<br></br>
+> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
+> `Config Param: TABLE_TYPE`<br></br>
+
+---
+
+> #### hoodie.datasource.write.table.name
+> Table name for the datasource write. Also used to register the table into
meta stores.<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: TABLE_NAME`<br></br>
+
+---
+
+> #### hoodie.datasource.write.recordkey.field
+> Record key field. Value to be used as the `recordKey` component of
`HoodieKey`.
+Actual value will be obtained by invoking .toString() on the field value.
Nested fields can be specified using
+the dot notation eg: `a.b.c`<br></br>
+> **Default Value**: uuid (Optional)<br></br>
+> `Config Param: RECORDKEY_FIELD`<br></br>
+
+---
+
+> #### hoodie.datasource.write.partitionpath.field
+> Partition path field. Value to be used at the partitionPath component of
HoodieKey. Actual value ontained by invoking .toString()<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: PARTITIONPATH_FIELD`<br></br>
+
+---
+
+> #### hoodie.datasource.write.keygenerator.class
+> Key generator class, that implements
`org.apache.hudi.keygen.KeyGenerator`<br></br>
+> **Default Value**: org.apache.hudi.keygen.SimpleKeyGenerator
(Optional)<br></br>
+> `Config Param: KEYGENERATOR_CLASS_NAME`<br></br>
+
+---
+
+> #### hoodie.datasource.write.precombine.field
+> Field used in preCombining before actual write. When two records have the
same key value, we will pick the one with the largest value for the precombine
field, determined by Object.compareTo(..)<br></br>
+> **Default Value**: ts (Optional)<br></br>
+> `Config Param: PRECOMBINE_FIELD`<br></br>
+
+---
+
+> #### hoodie.datasource.write.payload.class
+> Payload class used. Override this, if you like to roll your own merge logic,
when upserting/inserting. This will render any value set for
PRECOMBINE_FIELD_OPT_VAL in-effective<br></br>
+> **Default Value**:
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)<br></br>
+> `Config Param: PAYLOAD_CLASS_NAME`<br></br>
+
+---
+
+> #### hoodie.datasource.write.partitionpath.urlencode
+> Should we url encode the partition path value, before creating the folder
structure.<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: URL_ENCODE_PARTITIONING`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.enable
+> When set to true, register/sync the table to Apache Hive metastore<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_SYNC_ENABLED`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.mode
+> Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: HIVE_SYNC_MODE`<br></br>
+
+---
+
+> #### hoodie.datasource.write.hive_style_partitioning
+> Flag to indicate whether to use Hive style partitioning.
+If set true, the names of partition folders follow
<partition_column_name>=<partition_value> format.
+By default false (the names of partition folders are only partition
values)<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_STYLE_PARTITIONING`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.partition_fields
+> Field in the table to use for determining hive partition columns.<br></br>
+> **Default Value**: (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_FIELDS`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.partition_extractor_class
+> Class which implements PartitionValueExtractor to extract the partition
values, default 'SlashEncodedDayPartitionValueExtractor'.<br></br>
+> **Default Value**:
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_EXTRACTOR_CLASS`<br></br>
+
+---
+
+> #### hoodie.datasource.hive_sync.partition_fields
+> Field in the table to use for determining hive partition columns.<br></br>
+> **Default Value**: (Optional)<br></br>
+> `Config Param: HIVE_PARTITION_FIELDS`<br></br>
+
+---
+
+## Flink Sql Configs {#FLINK_SQL}
+These configs control the Hudi Flink SQL source/sink connectors, providing
ability to define record keys, pick out the write operation, specify how to
merge records, enable/disable asynchronous compaction or choosing query type to
read.
+
+### Flink Options {#Flink-Options}
+
+> #### path
+> Base path for the target hoodie table.
+The path would be created if it does not exist,
+otherwise a Hoodie table expects to be initialized successfully<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: PATH`<br></br>
+
+---
+
+> #### hoodie.table.name
+> Table name to register to Hive metastore<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: TABLE_NAME`<br></br>
+
+---
+
+
+> #### table.type
+> Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ<br></br>
+> **Default Value**: COPY_ON_WRITE (Optional)<br></br>
+> `Config Param: TABLE_TYPE`<br></br>
+
+---
+
+> #### write.operation
+> The write operation, that this write should do<br></br>
+> **Default Value**: upsert (Optional)<br></br>
+> `Config Param: OPERATION`<br></br>
+
+---
+
+> #### write.tasks
+> Parallelism of tasks that do actual write, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: WRITE_TASKS`<br></br>
+
+---
+
+> #### write.bucket_assign.tasks
+> Parallelism of tasks that do bucket assign, default is the parallelism of
the execution environment<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: BUCKET_ASSIGN_TASKS`<br></br>
+
+---
+
+> #### write.precombine
+> Flag to indicate whether to drop duplicates before insert/upsert.
+By default these cases will accept duplicates, to gain extra performance:
+1) insert operation;
+2) upsert for MOR table, the MOR table deduplicate on reading<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: PRE_COMBINE`<br></br>
+
+---
+
+> #### read.tasks
+> Parallelism of tasks that do actual read, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: READ_TASKS`<br></br>
+
+---
+
+> #### read.start-commit
+> Start commit instant for reading, the commit time format should be
'yyyyMMddHHmmss', by default reading from the latest instant for streaming
read<br></br>
+> **Default Value**: N/A (Required)<br></br>
+> `Config Param: READ_START_COMMIT`<br></br>
+
+---
+
+> #### read.streaming.enabled
+> Whether to read as streaming source, default false<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: READ_AS_STREAMING`<br></br>
+
+---
+
+> #### compaction.tasks
+> Parallelism of tasks that do actual compaction, default is 4<br></br>
+> **Default Value**: 4 (Optional)<br></br>
+> `Config Param: COMPACTION_TASKS`<br></br>
+
+---
+
+> #### hoodie.datasource.write.hive_style_partitioning
+> Whether to use Hive style partitioning.
+If set true, the names of partition folders follow
<partition_column_name>=<partition_value> format.
+By default false (the names of partition folders are only partition
values)<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_STYLE_PARTITIONING`<br></br>
+
+---
+
+> #### hive_sync.enable
+> Asynchronously sync Hive meta to HMS, default false<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: HIVE_SYNC_ENABLED`<br></br>
+
+---
+
+> #### hive_sync.mode
+> Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default
'jdbc'<br></br>
+> **Default Value**: jdbc (Optional)<br></br>
+> `Config Param: HIVE_SYNC_MODE`<br></br>
+
+---
+
+> #### hive_sync.table
+> Table name for hive sync, default 'unknown'<br></br>
+> **Default Value**: unknown (Optional)<br></br>
+> `Config Param: HIVE_SYNC_TABLE`<br></br>
+
+---
+
+> #### hive_sync.db
+> Database name for hive sync, default 'default'<br></br>
+> **Default Value**: default (Optional)<br></br>
+> `Config Param: HIVE_SYNC_DB`<br></br>
+
+---
+
+> #### hive_sync.partition_extractor_class
+> Tool to extract the partition value from HDFS path, default
'SlashEncodedDayPartitionValueExtractor'<br></br>
+> **Default Value**:
org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)<br></br>
+> `Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME`<br></br>
+
+---
+> #### hive_sync.metastore.uris
+> Metastore uris for hive sync, default ''<br></br>
+> **Default Value**: (Optional)<br></br>
+> `Config Param: HIVE_SYNC_METASTORE_URIS`<br></br>
+
+---
+
+
+## Write Client Configs {#WRITE_CLIENT}
+Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to
actually perform writes to storage. These configs provide deep control over
lower level aspects like file sizing, compression, parallelism, compaction,
write schema, cleaning etc. Although Hudi provides sane defaults, from
time-time these configs may need to be tweaked to optimize for specific
workloads.
+
+### Storage Configs
+
+Configurations that control aspects around writing, sizing, reading base and
log files.
+
+`Config Class`: org.apache.hudi.config.HoodieStorageConfig<br></br>
+
+> #### write.parquet.block.size
+> Parquet RowGroup size. It's recommended to make this large enough that scan
costs can be amortized by packing enough column values into a single row
group.<br></br>
+> **Default Value**: 120 (Optional)<br></br>
+> `Config Param: WRITE_PARQUET_BLOCK_SIZE`<br></br>
+
+---
+
+> #### write.parquet.max.file.size
+> Target size for parquet files produced by Hudi write phases. For DFS, this
needs to be aligned with the underlying filesystem block size for optimal
performance.<br></br>
+> **Default Value**: 120 (Optional)<br></br>
+> `Config Param: WRITE_PARQUET_MAX_FILE_SIZE`<br></br>
+
+---
+
+### Metadata Configs
+
+Configurations used by the Hudi Metadata Table. This table maintains the
metadata about a given Hudi table (e.g file listings) to avoid overhead of
accessing cloud storage, during queries.
+
+`Config Class`: org.apache.hudi.common.config.HoodieMetadataConfig<br></br>
+
+> #### hoodie.metadata.enable
+> Enable the internal metadata table which serves table metadata like level
file listings<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: ENABLE`<br></br>
+> `Since Version: 0.7.0`<br></br>
+
+---
+
+### Write Configurations
+
+Configurations that control write behavior on Hudi tables. These can be
directly passed down from even higher level frameworks (e.g Spark datasources,
Flink sink) and utilities (e.g DeltaStreamer).
+
+`Config Class`: org.apache.hudi.config.HoodieWriteConfig<br></br>
+
+> #### hoodie.combine.before.upsert
+> When upserted records share same key, controls whether they should be first
combined (i.e de-duplicated) before writing to storage. This should be turned
off only if you are absolutely certain that there are no duplicates incoming,
otherwise it can lead to duplicate keys and violate the uniqueness
guarantees.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: COMBINE_BEFORE_UPSERT`<br></br>
+
+---
+
+> #### hoodie.write.markers.type
+> Marker type to use. Two modes are supported: - DIRECT: individual marker
file corresponding to each data file is directly created by the writer. -
TIMELINE_SERVER_BASED: marker operations are all handled at the timeline
service which serves as a proxy. New marker entries are batch processed and
stored in a limited number of underlying files for efficiency. If HDFS is used
or timeline server is disabled, DIRECT markers are used as fallback even if
this is configure. For Spark structured streaming, this configuration does not
take effect, i.e., DIRECT markers are always used for Spark structured
streaming.<br></br>
+> **Default Value**: TIMELINE_SERVER_BASED (Optional)<br></br>
+> `Config Param: MARKERS_TYPE`<br></br>
+> `Since Version: 0.9.0`<br></br>
+
+---
+
+> #### hoodie.insert.shuffle.parallelism
+> Parallelism for inserting records into the table. Inserts can shuffle data
before writing to tune file sizes and optimize the storage layout.<br></br>
+> **Default Value**: 200 (Optional)<br></br>
+> `Config Param: INSERT_PARALLELISM_VALUE`<br></br>
+
+---
+
+> #### hoodie.rollback.parallelism
+> Parallelism for rollback of commits. Rollbacks perform delete of files or
logging delete blocks to file groups on storage in parallel.<br></br>
+> **Default Value**: 100 (Optional)<br></br>
+> `Config Param: ROLLBACK_PARALLELISM_VALUE`<br></br>
+
+---
+
+> #### hoodie.combine.before.delete
+> During delete operations, controls whether we should combine deletes (and
potentially also upserts) before writing to storage.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: COMBINE_BEFORE_DELETE`<br></br>
+
+---
+
+> #### hoodie.combine.before.insert
+> When inserted records share same key, controls whether they should be first
combined (i.e de-duplicated) before writing to storage.<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: COMBINE_BEFORE_INSERT`<br></br>
+
+---
+
+> #### hoodie.bulkinsert.shuffle.parallelism
+> For large initial imports using bulk_insert operation, controls the
parallelism to use for sort modes or custom partitioning donebefore writing
records to the table.<br></br>
+> **Default Value**: 200 (Optional)<br></br>
+> `Config Param: BULKINSERT_PARALLELISM_VALUE`<br></br>
+
+---
+
+> #### hoodie.delete.shuffle.parallelism
+> Parallelism used for “delete” operation. Delete operations also performs
shuffles, similar to upsert operation.<br></br>
+> **Default Value**: 200 (Optional)<br></br>
+> `Config Param: DELETE_PARALLELISM_VALUE`<br></br>
+
+---
+
+> #### hoodie.bulkinsert.sort.mode
+> Sorting modes to use for sorting records for bulk insert. This is use when
user hoodie.bulkinsert.user.defined.partitioner.classis not configured.
Available values are - GLOBAL_SORT: this ensures best file sizes, with lowest
memory overhead at cost of sorting. PARTITION_SORT: Strikes a balance by only
sorting within a partition, still keeping the memory overhead of writing lowest
and best effort file sizing. NONE: No sorting. Fastest and matches
`spark.write.parquet()` in terms of number of files, overheads<br></br>
+> **Default Value**: GLOBAL_SORT (Optional)<br></br>
+> `Config Param: BULK_INSERT_SORT_MODE`<br></br>
+
+---
+
+> #### hoodie.embed.timeline.server
+> When true, spins up an instance of the timeline server (meta server that
serves cached file listings, statistics),running on each writer's driver
process, accepting requests during the write from executors.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: EMBEDDED_TIMELINE_SERVER_ENABLE`<br></br>
+
+---
+
+> #### hoodie.upsert.shuffle.parallelism
+> Parallelism to use for upsert operation on the table. Upserts can shuffle
data to perform index lookups, file sizing, bin packing records optimallyinto
file groups.<br></br>
+> **Default Value**: 200 (Optional)<br></br>
+> `Config Param: UPSERT_PARALLELISM_VALUE`<br></br>
+
+---
+
+> #### hoodie.rollback.using.markers
+> Enables a more efficient mechanism for rollbacks based on the marker files
generated during the writes. Turned on by default.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: ROLLBACK_USING_MARKERS_ENABLE`<br></br>
+
+---
+
+> #### hoodie.finalize.write.parallelism
+> Parallelism for the write finalization internal operation, which involves
removing any partially written files from lake storage, before committing the
write. Reduce this value, if the high number of tasks incur delays for smaller
tables or low latency writes.<br></br>
+> **Default Value**: 200 (Optional)<br></br>
+> `Config Param: FINALIZE_WRITE_PARALLELISM_VALUE`<br></br>
+
+---
+
+### Compaction Configs {#Compaction-Configs}
+
+Configurations that control compaction (merging of log files onto a new base
files) as well as cleaning (reclamation of older/unused file groups/slices).
+
+`Config Class`: org.apache.hudi.config.HoodieCompactionConfig<br></br>
+
+> #### hoodie.cleaner.policy
+> Cleaning policy to be used. The cleaner service deletes older file slices
files to re-claim space. By default, cleaner spares the file slices written by
the last N commits, determined by hoodie.cleaner.commits.retained Long running
query plans may often refer to older file slices and will break if those are
cleaned, before the query has had a chance to run. So, it is good to make
sure that the data is retained for more than the maximum query execution
time<br></br>
+> **Default Value**: KEEP_LATEST_COMMITS (Optional)<br></br>
+> `Config Param: CLEANER_POLICY`<br></br>
+
+---
+
+> #### hoodie.copyonwrite.record.size.estimate
+> The average record size. If not explicitly specified, hudi will compute the
record size estimate compute dynamically based on commit metadata. This is
critical in computing the insert parallelism and bin-packing inserts into small
files.<br></br>
+> **Default Value**: 1024 (Optional)<br></br>
+> `Config Param: COPY_ON_WRITE_RECORD_SIZE_ESTIMATE`<br></br>
+
+---
+
+> #### hoodie.compact.inline.max.delta.seconds
+> Number of elapsed seconds after the last compaction, before scheduling a new
one.<br></br>
+> **Default Value**: 3600 (Optional)<br></br>
+> `Config Param: INLINE_COMPACT_TIME_DELTA_SECONDS`<br></br>
+
+---
+
+> #### hoodie.cleaner.commits.retained
+> Number of commits to retain, without cleaning. This will be retained for
num_of_commits * time_between_commits (scheduled). This also directly
translates into how much data retention the table supports for incremental
queries.<br></br>
+> **Default Value**: 10 (Optional)<br></br>
+> `Config Param: CLEANER_COMMITS_RETAINED`<br></br>
+
+---
+
+> #### hoodie.clean.async
+> Only applies when hoodie.clean.automatic is turned on. When turned on runs
cleaner async with writing, which can speed up overall write
performance.<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: ASYNC_CLEAN`<br></br>
+
+---
+
+> #### hoodie.clean.automatic
+> When enabled, the cleaner table service is invoked immediately after each
commit, to delete older file slices. It's recommended to enable this, to ensure
metadata and data storage growth is bounded.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: AUTO_CLEAN`<br></br>
+
+---
+
+> #### hoodie.commits.archival.batch
+> Archiving of instants is batched in best-effort manner, to pack more
instants into a single archive log. This config controls such archival batch
size.<br></br>
+> **Default Value**: 10 (Optional)<br></br>
+> `Config Param: COMMITS_ARCHIVAL_BATCH_SIZE`<br></br>
+
+---
+
+> #### hoodie.compact.inline
+> When set to true, compaction service is triggered after each write. While
being simpler operationally, this adds extra latency on the write
path.<br></br>
+> **Default Value**: false (Optional)<br></br>
+> `Config Param: INLINE_COMPACT`<br></br>
+
+---
+
+> #### hoodie.parquet.small.file.limit
+> During upsert operation, we opportunistically expand existing small files on
storage, instead of writing new files, to keep number of files to an optimum.
This config sets the file size limit below which a file on storage becomes a
candidate to be selected as such a `small file`. By default, treat any file <=
100MB as a small file.<br></br>
+> **Default Value**: 104857600 (Optional)<br></br>
+> `Config Param: PARQUET_SMALL_FILE_LIMIT`<br></br>
+
+---
+
+> #### hoodie.compaction.strategy
+> Compaction strategy decides which file groups are picked up for compaction
during each compaction run. By default. Hudi picks the log file with most
accumulated unmerged data<br></br>
+> **Default Value**:
org.apache.hudi.table.action.compact.strategy.LogFileSizeBasedCompactionStrategy
(Optional)<br></br>
+> `Config Param: COMPACTION_STRATEGY`<br></br>
+
+---
+
+> #### hoodie.archive.automatic
+> When enabled, the archival table service is invoked immediately after each
commit, to archive commits if we cross a maximum value of commits. It's
recommended to enable this, to ensure number of active commits is
bounded.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: AUTO_ARCHIVE`<br></br>
+
+---
+
+> #### hoodie.copyonwrite.insert.auto.split
+> Config to control whether we control insert split sizes automatically based
on average record sizes. It's recommended to keep this turned on, since hand
tuning is otherwise extremely cumbersome.<br></br>
+> **Default Value**: true (Optional)<br></br>
+> `Config Param: COPY_ON_WRITE_AUTO_SPLIT_INSERTS`<br></br>
+
+---
+
+> #### hoodie.compact.inline.max.delta.commits
+> Number of delta commits after the last compaction, before scheduling of a
new compaction is attempted.<br></br>
+> **Default Value**: 5 (Optional)<br></br>
+> `Config Param: INLINE_COMPACT_NUM_DELTA_COMMITS`<br></br>
+
+---
+
+> #### hoodie.keep.min.commits
+> Similar to hoodie.keep.max.commits, but controls the minimum number
ofinstants to retain in the active timeline.<br></br>
+> **Default Value**: 20 (Optional)<br></br>
+> `Config Param: MIN_COMMITS_TO_KEEP`<br></br>
+
+---
+
+> #### hoodie.cleaner.parallelism
+> Parallelism for the cleaning operation. Increase this if cleaning becomes
slow.<br></br>
+> **Default Value**: 200 (Optional)<br></br>
+> `Config Param: CLEANER_PARALLELISM_VALUE`<br></br>
+
+---
+
+> #### hoodie.record.size.estimation.threshold
+> We use the previous commits' metadata to calculate the estimated record size
and use it to bin pack records into partitions. If the previous commit is too
small to make an accurate estimation, Hudi will search commits in the reverse
order, until we find a commit that has totalBytesWritten larger than
(PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)<br></br>
Review Comment:
Two more places in this statement have extra spaces.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]