This is an automated email from the ASF dual-hosted git repository.
nagarwal pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 57372eb Fixing some of the configs (#2947)
57372eb is described below
commit 57372ebd00a68e2ba4faae700f9417e31324c238
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Thu May 13 14:38:10 2021 -0400
Fixing some of the configs (#2947)
---
docs/_docs/0.6.0/2_4_configurations.md | 7 ++++++-
docs/_docs/0.7.0/2_4_configurations.md | 15 ++++++++++-----
docs/_docs/0.8.0/2_4_configurations.md | 12 +++++++++++-
docs/_docs/2_4_configurations.md | 22 ++++++++++++++++------
4 files changed, 43 insertions(+), 13 deletions(-)
diff --git a/docs/_docs/0.6.0/2_4_configurations.md
b/docs/_docs/0.6.0/2_4_configurations.md
index b97c867..e4ef4b1 100644
--- a/docs/_docs/0.6.0/2_4_configurations.md
+++ b/docs/_docs/0.6.0/2_4_configurations.md
@@ -93,7 +93,12 @@ This is useful to store checkpointing information, in a
consistent way with the
#### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false`
<br/>
<span style="color:grey">If set to true, filters out all duplicate records
from incoming dataframe, during insert operations. </span>
-
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+<span style="color:grey">When set to true, will perform write operations
directly using the spark native `Row`
+representation. This is expected to be faster by 20 to 30% than regular
bulk_insert by setting this config</span>
+
#### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
<span style="color:grey">When set to true, register/sync the table to Apache
Hive metastore</span>
diff --git a/docs/_docs/0.7.0/2_4_configurations.md
b/docs/_docs/0.7.0/2_4_configurations.md
index d243105..a0f8467 100644
--- a/docs/_docs/0.7.0/2_4_configurations.md
+++ b/docs/_docs/0.7.0/2_4_configurations.md
@@ -93,7 +93,12 @@ This is useful to store checkpointing information, in a
consistent way with the
#### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false`
<br/>
<span style="color:grey">If set to true, filters out all duplicate records
from incoming dataframe, during insert operations. </span>
-
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+<span style="color:grey">When set to true, will perform write operations
directly using the spark native `Row`
+representation. This is expected to be faster by 20 to 30% than regular
bulk_insert by setting this config</span>
+
#### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
<span style="color:grey">When set to true, register/sync the table to Apache
Hive metastore</span>
@@ -243,10 +248,6 @@ Property: `hoodie.write.status.storage.level`<br/>
Property: `hoodie.auto.commit`<br/>
<span style="color:grey">Should HoodieWriteClient autoCommit after insert and
upsert. The client can choose to turn off auto-commit and commit on a "defined
success condition"</span>
-#### withAssumeDatePartitioning(assumeDatePartitioning = false)
{#withAssumeDatePartitioning}
-Property: `hoodie.assume.date.partitioning`<br/>
-<span style="color:grey">Should HoodieWriteClient assume the data is
partitioned by dates, i.e three levels from base path. This is a stop-gap to
support tables created by versions < 0.3.1. Will be removed eventually </span>
-
#### withConsistencyCheckEnabled(enabled = false)
{#withConsistencyCheckEnabled}
Property: `hoodie.consistency.check.enabled`<br/>
<span style="color:grey">Should HoodieWriteClient perform additional checks to
ensure written files' are listable on the underlying filesystem/storage. Set
this to true, to workaround S3's eventual consistency model and ensure all data
written as a part of a commit is faithfully available for queries. </span>
@@ -559,6 +560,10 @@ Property: `hoodie.metadata.compact.max.delta.commits` <br/>
Property: `hoodie.metadata.keep.min.commits`,
`hoodie.metadata.keep.max.commits` <br/>
<span style="color:grey"> Controls the archival of the metadata table's
timeline </span>
+#### withAssumeDatePartitioning(assumeDatePartitioning = false)
{#withAssumeDatePartitioning}
+Property: `hoodie.assume.date.partitioning`<br/>
+<span style="color:grey">Should HoodieWriteClient assume the data is
partitioned by dates, i.e three levels from base path. This is a stop-gap to
support tables created by versions < 0.3.1. Will be removed eventually </span>
+
### Clustering Configs
Controls clustering operations in hudi. Each clustering has to be configured
for its strategy, and config params. This config drives the same.
diff --git a/docs/_docs/0.8.0/2_4_configurations.md
b/docs/_docs/0.8.0/2_4_configurations.md
index 207bf80..e93d304 100644
--- a/docs/_docs/0.8.0/2_4_configurations.md
+++ b/docs/_docs/0.8.0/2_4_configurations.md
@@ -94,7 +94,12 @@ This is useful to store checkpointing information, in a
consistent way with the
#### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false`
<br/>
<span style="color:grey">If set to true, filters out all duplicate records
from incoming dataframe, during insert operations. </span>
-
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+<span style="color:grey">When set to true, will perform write operations
directly using the spark native `Row`
+representation. This is expected to be faster by 20 to 30% than regular
bulk_insert by setting this config</span>
+
#### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
<span style="color:grey">When set to true, register/sync the table to Apache
Hive metastore</span>
@@ -288,6 +293,11 @@ Property: `hoodie.combine.before.insert`,
`hoodie.combine.before.upsert`<br/>
Property: `hoodie.combine.before.delete`<br/>
<span style="color:grey">Flag which first combines the input RDD and merges
multiple partial records into a single record before deleting in DFS</span>
+#### withMergeAllowDuplicateOnInserts(mergeAllowDuplicateOnInserts = false)
{#withMergeAllowDuplicateOnInserts}
+Property: `hoodie.merge.allow.duplicate.on.inserts` <br/>
+<span style="color:grey"> When enabled, will route new records as inserts and
will not merge with existing records.
+Result could contain duplicate entries. </span>
+
#### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER)
{#withWriteStatusStorageLevel}
Property: `hoodie.write.status.storage.level`<br/>
<span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert
returns a persisted RDD[WriteStatus], this is because the Client can choose to
inspect the WriteStatus and choose and commit or not based on the failures.
This is a configuration for the storage level for this RDD </span>
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index d8cad48..db93226 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -93,11 +93,16 @@ This is useful to store checkpointing information, in a
consistent way with the
#### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false`
<br/>
<span style="color:grey">If set to true, filters out all duplicate records
from incoming dataframe, during insert operations. </span>
-
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+ Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+ <span style="color:grey">When set to true, will perform write operations
directly using the spark native `Row`
+ representation. This is expected to be faster by 20 to 30% than regular
bulk_insert by setting this config</span>
+
#### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
<span style="color:grey">When set to true, register/sync the table to Apache
Hive metastore</span>
-
+
#### HIVE_DATABASE_OPT_KEY {#HIVE_DATABASE_OPT_KEY}
Property: `hoodie.datasource.hive_sync.database`, Default: `default` <br/>
<span style="color:grey">database to sync to</span>
@@ -327,6 +332,11 @@ Property: `hoodie.combine.before.insert`,
`hoodie.combine.before.upsert`<br/>
Property: `hoodie.combine.before.delete`<br/>
<span style="color:grey">Flag which first combines the input RDD and merges
multiple partial records into a single record before deleting in DFS</span>
+#### withMergeAllowDuplicateOnInserts(mergeAllowDuplicateOnInserts = false)
{#withMergeAllowDuplicateOnInserts}
+Property: `hoodie.merge.allow.duplicate.on.inserts` <br/>
+<span style="color:grey"> When enabled, will route new records as inserts and
will not merge with existing records.
+Result could contain duplicate entries. </span>
+
#### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER)
{#withWriteStatusStorageLevel}
Property: `hoodie.write.status.storage.level`<br/>
<span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert
returns a persisted RDD[WriteStatus], this is because the Client can choose to
inspect the WriteStatus and choose and commit or not based on the failures.
This is a configuration for the storage level for this RDD </span>
@@ -335,10 +345,6 @@ Property: `hoodie.write.status.storage.level`<br/>
Property: `hoodie.auto.commit`<br/>
<span style="color:grey">Should HoodieWriteClient autoCommit after insert and
upsert. The client can choose to turn off auto-commit and commit on a "defined
success condition"</span>
-#### withAssumeDatePartitioning(assumeDatePartitioning = false)
{#withAssumeDatePartitioning}
-Property: `hoodie.assume.date.partitioning`<br/>
-<span style="color:grey">Should HoodieWriteClient assume the data is
partitioned by dates, i.e three levels from base path. This is a stop-gap to
support tables created by versions < 0.3.1. Will be removed eventually </span>
-
#### withConsistencyCheckEnabled(enabled = false)
{#withConsistencyCheckEnabled}
Property: `hoodie.consistency.check.enabled`<br/>
<span style="color:grey">Should HoodieWriteClient perform additional checks to
ensure written files' are listable on the underlying filesystem/storage. Set
this to true, to workaround S3's eventual consistency model and ensure all data
written as a part of a commit is faithfully available for queries. </span>
@@ -655,6 +661,10 @@ Property: `hoodie.metadata.compact.max.delta.commits` <br/>
Property: `hoodie.metadata.keep.min.commits`,
`hoodie.metadata.keep.max.commits` <br/>
<span style="color:grey"> Controls the archival of the metadata table's
timeline </span>
+#### withAssumeDatePartitioning(assumeDatePartitioning = false)
{#withAssumeDatePartitioning}
+Property: `hoodie.assume.date.partitioning`<br/>
+<span style="color:grey">Should HoodieWriteClient assume the data is
partitioned by dates, i.e three levels from base path. This is a stop-gap to
support tables created by versions < 0.3.1. Will be removed eventually </span>
+
### Clustering Configs
Controls clustering operations in hudi. Each clustering has to be configured
for its strategy, and config params. This config drives the same.