[hudi] branch asf-site updated: Fixing some of the configs (#2947)

nagarwal Thu, 13 May 2021 11:38:33 -0700

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 57372eb  Fixing some of the configs (#2947)
57372eb is described below

commit 57372ebd00a68e2ba4faae700f9417e31324c238
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Thu May 13 14:38:10 2021 -0400

    Fixing some of the configs (#2947)
---
 docs/_docs/0.6.0/2_4_configurations.md |  7 ++++++-
 docs/_docs/0.7.0/2_4_configurations.md | 15 ++++++++++-----
 docs/_docs/0.8.0/2_4_configurations.md | 12 +++++++++++-
 docs/_docs/2_4_configurations.md       | 22 ++++++++++++++++------
 4 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/docs/_docs/0.6.0/2_4_configurations.md 
b/docs/_docs/0.6.0/2_4_configurations.md
index b97c867..e4ef4b1 100644
--- a/docs/_docs/0.6.0/2_4_configurations.md
+++ b/docs/_docs/0.6.0/2_4_configurations.md
@@ -93,7 +93,12 @@ This is useful to store checkpointing information, in a 
consistent way with the
 #### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
   Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false` 
<br/>
   <span style="color:grey">If set to true, filters out all duplicate records 
from incoming dataframe, during insert operations. </span>
-  
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+<span style="color:grey">When set to true, will perform write operations 
directly using the spark native `Row`
+representation. This is expected to be faster by 20 to 30% than regular 
bulk_insert by setting this config</span>
+
 #### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
   <span style="color:grey">When set to true, register/sync the table to Apache 
Hive metastore</span>
diff --git a/docs/_docs/0.7.0/2_4_configurations.md 
b/docs/_docs/0.7.0/2_4_configurations.md
index d243105..a0f8467 100644
--- a/docs/_docs/0.7.0/2_4_configurations.md
+++ b/docs/_docs/0.7.0/2_4_configurations.md
@@ -93,7 +93,12 @@ This is useful to store checkpointing information, in a 
consistent way with the
 #### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
   Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false` 
<br/>
   <span style="color:grey">If set to true, filters out all duplicate records 
from incoming dataframe, during insert operations. </span>
-  
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+<span style="color:grey">When set to true, will perform write operations 
directly using the spark native `Row`
+representation. This is expected to be faster by 20 to 30% than regular 
bulk_insert by setting this config</span>
+
 #### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
   <span style="color:grey">When set to true, register/sync the table to Apache 
Hive metastore</span>
@@ -243,10 +248,6 @@ Property: `hoodie.write.status.storage.level`<br/>
 Property: `hoodie.auto.commit`<br/>
 <span style="color:grey">Should HoodieWriteClient autoCommit after insert and 
upsert. The client can choose to turn off auto-commit and commit on a "defined 
success condition"</span>
 
-#### withAssumeDatePartitioning(assumeDatePartitioning = false) 
{#withAssumeDatePartitioning} 
-Property: `hoodie.assume.date.partitioning`<br/>
-<span style="color:grey">Should HoodieWriteClient assume the data is 
partitioned by dates, i.e three levels from base path. This is a stop-gap to 
support tables created by versions < 0.3.1. Will be removed eventually </span>
-
 #### withConsistencyCheckEnabled(enabled = false) 
{#withConsistencyCheckEnabled} 
 Property: `hoodie.consistency.check.enabled`<br/>
 <span style="color:grey">Should HoodieWriteClient perform additional checks to 
ensure written files' are listable on the underlying filesystem/storage. Set 
this to true, to workaround S3's eventual consistency model and ensure all data 
written as a part of a commit is faithfully available for queries. </span>
@@ -559,6 +560,10 @@ Property: `hoodie.metadata.compact.max.delta.commits` <br/>
 Property: `hoodie.metadata.keep.min.commits`, 
`hoodie.metadata.keep.max.commits` <br/>
 <span style="color:grey"> Controls the archival of the metadata table's 
timeline </span>
 
+#### withAssumeDatePartitioning(assumeDatePartitioning = false) 
{#withAssumeDatePartitioning}
+Property: `hoodie.assume.date.partitioning`<br/>
+<span style="color:grey">Should HoodieWriteClient assume the data is 
partitioned by dates, i.e three levels from base path. This is a stop-gap to 
support tables created by versions < 0.3.1. Will be removed eventually </span>
+
 ### Clustering Configs
 Controls clustering operations in hudi. Each clustering has to be configured 
for its strategy, and config params. This config drives the same. 
 
diff --git a/docs/_docs/0.8.0/2_4_configurations.md 
b/docs/_docs/0.8.0/2_4_configurations.md
index 207bf80..e93d304 100644
--- a/docs/_docs/0.8.0/2_4_configurations.md
+++ b/docs/_docs/0.8.0/2_4_configurations.md
@@ -94,7 +94,12 @@ This is useful to store checkpointing information, in a 
consistent way with the
 #### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
   Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false` 
<br/>
   <span style="color:grey">If set to true, filters out all duplicate records 
from incoming dataframe, during insert operations. </span>
-  
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+<span style="color:grey">When set to true, will perform write operations 
directly using the spark native `Row`
+representation. This is expected to be faster by 20 to 30% than regular 
bulk_insert by setting this config</span>
+
 #### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
   <span style="color:grey">When set to true, register/sync the table to Apache 
Hive metastore</span>
@@ -288,6 +293,11 @@ Property: `hoodie.combine.before.insert`, 
`hoodie.combine.before.upsert`<br/>
 Property: `hoodie.combine.before.delete`<br/>
 <span style="color:grey">Flag which first combines the input RDD and merges 
multiple partial records into a single record before deleting in DFS</span>
 
+#### withMergeAllowDuplicateOnInserts(mergeAllowDuplicateOnInserts = false) 
{#withMergeAllowDuplicateOnInserts}
+Property: `hoodie.merge.allow.duplicate.on.inserts` <br/>
+<span style="color:grey"> When enabled, will route new records as inserts and 
will not merge with existing records.
+Result could contain duplicate entries. </span>
+
 #### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER) 
{#withWriteStatusStorageLevel} 
 Property: `hoodie.write.status.storage.level`<br/>
 <span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert 
returns a persisted RDD[WriteStatus], this is because the Client can choose to 
inspect the WriteStatus and choose and commit or not based on the failures. 
This is a configuration for the storage level for this RDD </span>
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index d8cad48..db93226 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -93,11 +93,16 @@ This is useful to store checkpointing information, in a 
consistent way with the
 #### INSERT_DROP_DUPS_OPT_KEY {#INSERT_DROP_DUPS_OPT_KEY}
   Property: `hoodie.datasource.write.insert.drop.duplicates`, Default: `false` 
<br/>
   <span style="color:grey">If set to true, filters out all duplicate records 
from incoming dataframe, during insert operations. </span>
-  
+
+#### ENABLE_ROW_WRITER_OPT_KEY {#ENABLE_ROW_WRITER_OPT_KEY}
+  Property: `hoodie.datasource.write.row.writer.enable`, Default: `false` <br/>
+  <span style="color:grey">When set to true, will perform write operations 
directly using the spark native `Row` 
+  representation. This is expected to be faster by 20 to 30% than regular 
bulk_insert by setting this config</span>
+
 #### HIVE_SYNC_ENABLED_OPT_KEY {#HIVE_SYNC_ENABLED_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.enable`, Default: `false` <br/>
   <span style="color:grey">When set to true, register/sync the table to Apache 
Hive metastore</span>
-  
+
 #### HIVE_DATABASE_OPT_KEY {#HIVE_DATABASE_OPT_KEY}
   Property: `hoodie.datasource.hive_sync.database`, Default: `default` <br/>
   <span style="color:grey">database to sync to</span>
@@ -327,6 +332,11 @@ Property: `hoodie.combine.before.insert`, 
`hoodie.combine.before.upsert`<br/>
 Property: `hoodie.combine.before.delete`<br/>
 <span style="color:grey">Flag which first combines the input RDD and merges 
multiple partial records into a single record before deleting in DFS</span>
 
+#### withMergeAllowDuplicateOnInserts(mergeAllowDuplicateOnInserts = false) 
{#withMergeAllowDuplicateOnInserts}
+Property: `hoodie.merge.allow.duplicate.on.inserts` <br/>
+<span style="color:grey"> When enabled, will route new records as inserts and 
will not merge with existing records. 
+Result could contain duplicate entries. </span>
+
 #### withWriteStatusStorageLevel(level = MEMORY_AND_DISK_SER) 
{#withWriteStatusStorageLevel} 
 Property: `hoodie.write.status.storage.level`<br/>
 <span style="color:grey">HoodieWriteClient.insert and HoodieWriteClient.upsert 
returns a persisted RDD[WriteStatus], this is because the Client can choose to 
inspect the WriteStatus and choose and commit or not based on the failures. 
This is a configuration for the storage level for this RDD </span>
@@ -335,10 +345,6 @@ Property: `hoodie.write.status.storage.level`<br/>
 Property: `hoodie.auto.commit`<br/>
 <span style="color:grey">Should HoodieWriteClient autoCommit after insert and 
upsert. The client can choose to turn off auto-commit and commit on a "defined 
success condition"</span>
 
-#### withAssumeDatePartitioning(assumeDatePartitioning = false) 
{#withAssumeDatePartitioning} 
-Property: `hoodie.assume.date.partitioning`<br/>
-<span style="color:grey">Should HoodieWriteClient assume the data is 
partitioned by dates, i.e three levels from base path. This is a stop-gap to 
support tables created by versions < 0.3.1. Will be removed eventually </span>
-
 #### withConsistencyCheckEnabled(enabled = false) 
{#withConsistencyCheckEnabled} 
 Property: `hoodie.consistency.check.enabled`<br/>
 <span style="color:grey">Should HoodieWriteClient perform additional checks to 
ensure written files' are listable on the underlying filesystem/storage. Set 
this to true, to workaround S3's eventual consistency model and ensure all data 
written as a part of a commit is faithfully available for queries. </span>
@@ -655,6 +661,10 @@ Property: `hoodie.metadata.compact.max.delta.commits` <br/>
 Property: `hoodie.metadata.keep.min.commits`, 
`hoodie.metadata.keep.max.commits` <br/>
 <span style="color:grey"> Controls the archival of the metadata table's 
timeline </span>
 
+#### withAssumeDatePartitioning(assumeDatePartitioning = false) 
{#withAssumeDatePartitioning}
+Property: `hoodie.assume.date.partitioning`<br/>
+<span style="color:grey">Should HoodieWriteClient assume the data is 
partitioned by dates, i.e three levels from base path. This is a stop-gap to 
support tables created by versions < 0.3.1. Will be removed eventually </span>
+
 ### Clustering Configs
 Controls clustering operations in hudi. Each clustering has to be configured 
for its strategy, and config params. This config drives the same.

[hudi] branch asf-site updated: Fixing some of the configs (#2947)

Reply via email to