[incubator-hudi] branch asf-site updated: Configurations update for releasing 0.4.6

vinoth Mon, 13 May 2019 22:26:26 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 3dd51f7  Configurations update for releasing 0.4.6
3dd51f7 is described below

commit 3dd51f7c121a1b008f52a3917d5cc701b5f35aeb
Author: Balaji Varadarajan <[email protected]>
AuthorDate: Mon May 13 22:20:25 2019 -0700

    Configurations update for releasing 0.4.6
---
 docs/configurations.md | 36 +++++++++++++++++++++++++++++++-----
 docs/writing_data.md   | 17 +++++++++++++++--
 2 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/docs/configurations.md b/docs/configurations.md
index e8cab52..9580aa3 100644
--- a/docs/configurations.md
+++ b/docs/configurations.md
@@ -244,21 +244,37 @@ Property: `hoodie.bloom.index.prune.by.ranges` <br/>
 Property: `hoodie.bloom.index.use.caching` <br/>
 <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
the input RDD will cached to speed up index lookup by reducing IO for computing 
parallelism or affected partitions</span>
 
+##### bloomIndexTreebasedFilter(useTreeFilter = true) 
{#bloomIndexTreebasedFilter}
+Property: `hoodie.bloom.index.use.treebased.filter` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
interval tree based file pruning optimization is enabled. This mode speeds-up 
file-pruning based on key ranges when compared with the brute-force mode</span>
+
+##### bloomIndexBucketizedChecking(bucketizedChecking = true) 
{#bloomIndexBucketizedChecking}
+Property: `hoodie.bloom.index.bucketized.checking` <br/>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
bucketized bloom filtering is enabled. This reduces skew seen in sort based 
bloom index lookup</span>
+
+##### bloomIndexKeysPerBucket(keysPerBucket = 10000000) 
{#bloomIndexKeysPerBucket}
+Property: `hoodie.bloom.index.keys.per.bucket` <br/>
+<span style="color:grey">Only applies if bloomIndexBucketizedChecking is 
enabled and index type is bloom. <br/> This configuration controls the "bucket" 
size which tracks the number of record-key checks made against a single file 
and is the unit of work allocated to each partition performing bloom filter 
lookup. A higher value would amortize the fixed cost of reading a bloom filter 
to memory. </span>
+
 ##### bloomIndexParallelism(0) {#bloomIndexParallelism}
 Property: `hoodie.bloom.index.parallelism` <br/>
 <span style="color:grey">Only applies if index type is BLOOM. <br/> This is 
the amount of parallelism for index lookup, which involves a Spark Shuffle. By 
default, this is auto computed based on input workload characteristics</span>
 
 ##### hbaseZkQuorum(zkString) [Required] {#hbaseZkQuorum}  
 Property: `hoodie.index.hbase.zkquorum` <br/>
-<span style="color:grey">Only application if index type is HBASE. HBase ZK 
Quorum url to connect to.</span>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum 
url to connect to.</span>
 
 ##### hbaseZkPort(port) [Required] {#hbaseZkPort}  
 Property: `hoodie.index.hbase.zkport` <br/>
-<span style="color:grey">Only application if index type is HBASE. HBase ZK 
Quorum port to connect to.</span>
+<span style="color:grey">Only applies if index type is HBASE. HBase ZK Quorum 
port to connect to.</span>
+
+##### hbaseZkZnodeParent(zkZnodeParent)  [Required] {#hbaseTableName}
+Property: `hoodie.index.hbase.zknode.path` <br/>
+<span style="color:grey">Only applies if index type is HBASE. This is the root 
znode that will contain all the znodes created/used by HBase.</span>
 
 ##### hbaseTableName(tableName)  [Required] {#hbaseTableName}
 Property: `hoodie.index.hbase.table` <br/>
-<span style="color:grey">Only application if index type is HBASE. HBase Table 
name to use as the index. Hudi stores the row_key and [partition_path, fileID, 
commitTime] mapping in the table.</span>
+<span style="color:grey">Only applies if index type is HBASE. HBase Table name 
to use as the index. Hudi stores the row_key and [partition_path, fileID, 
commitTime] mapping in the table.</span>
 
     
 #### Storage configs
@@ -282,7 +298,7 @@ Property: `hoodie.parquet.page.size` <br/>
 Property: `hoodie.parquet.compression.ratio` <br/>
 <span style="color:grey">Expected compression of parquet data used by Hudi, 
when it tries to size new parquet files. Increase this value, if bulk_insert is 
producing smaller than expected sized files</span>
 
-##### parquetCompressionCodec(parquetCompressionCodec = gzip) 
{#parquetCompressionCodec} 
+##### parquetCompressionCodec(parquetCompressionCodec = gzip) 
{#parquetCompressionCodec}
 Property: `hoodie.parquet.compression.codec` <br/>
 <span style="color:grey">Parquet compression codec name. Default is gzip. 
Possible options are [gzip | snappy | uncompressed | lzo]</span>
 
@@ -298,7 +314,10 @@ Property: `hoodie.logfile.data.block.max.size` <br/>
 Property: `hoodie.logfile.to.parquet.compression.ratio` <br/>
 <span style="color:grey">Expected additional compression as records move from 
log files to parquet. Used for merge_on_read storage to send inserts into log 
files & control the size of compacted parquet file.</span>
  
-    
+##### parquetCompressionCodec(parquetCompressionCodec = gzip) 
{#parquetCompressionCodec} 
+Property: `hoodie.parquet.compression.codec` <br/>
+<span style="color:grey">Compression Codec for parquet files </span>
+
 #### Compaction configs
 Configs that control compaction (merging of log files onto a new parquet base 
file), cleaning (reclamation of older/unused file groups).
 [withCompactionConfig](#withCompactionConfig) (HoodieCompactionConfig) <br/>
@@ -315,6 +334,10 @@ Property: `hoodie.cleaner.commits.retained` <br/>
 Property: `hoodie.keep.min.commits`, `hoodie.keep.max.commits` <br/>
 <span style="color:grey">Each commit is a small file in the `.hoodie` 
directory. Since DFS typically does not favor lots of small files, Hudi 
archives older commits into a sequential log. A commit is published atomically 
by a rename of the commit file.</span>
 
+##### withCommitsArchivalBatchSize(batch = 10) {#withCommitsArchivalBatchSize}
+Property: `hoodie.commits.archival.batch` <br/>
+<span style="color:grey">This controls the number of commit instants read in 
memory as a batch and archived together.</span>
+
 ##### compactionSmallFileSize(size = 0) {#compactionSmallFileSize} 
 Property: `hoodie.parquet.small.file.limit` <br/>
 <span style="color:grey">This should be less < maxFileSize and setting it to 
0, turns off this feature. Small files can always happen because of the number 
of insert records in a partition in a batch. Hudi has an option to auto-resolve 
small files by masking inserts into this partition as updates to existing small 
files. The size here is the minimum file size considered as a "small file 
size".</span>
@@ -407,3 +430,6 @@ Property: `hoodie.memory.merge.fraction` <br/>
 Property: `hoodie.memory.compaction.fraction` <br/>
 <span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts 
records to HoodieRecords and then merges these log blocks and records. At any 
point, the number of entries in a log block can be less than or equal to the 
number of entries in the corresponding parquet file. This can lead to OOM in 
the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use 
this config to set the max allowable inMemory footprint of the spillable 
map.</span>
 
+##### withWriteStatusFailureFraction(failureFraction = 0.1) 
{#withWriteStatusFailureFraction}
+Property: `hoodie.memory.writestatus.failure.fraction` <br/>
+<span style="color:grey">This property controls what fraction of the failed 
record, exceptions we report back to driver</span>
diff --git a/docs/writing_data.md b/docs/writing_data.md
index c060134..54a3801 100644
--- a/docs/writing_data.md
+++ b/docs/writing_data.md
@@ -26,8 +26,22 @@ Command line options describe capabilities in more detail
 [hoodie]$ spark-submit --class 
com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer `ls 
hoodie-utilities/target/hoodie-utilities-*-SNAPSHOT.jar` --help
 Usage: <main class> [options]
   Options:
+    --commit-on-errors
+        Commit even when some records failed to be written
+      Default: false
+    --enable-hive-sync
+          Enable syncing to hive
+       Default: false
+    --filter-dupes
+          Should duplicate records from source be dropped/filtered outbefore 
+          insert/bulk-insert 
+      Default: false
     --help, -h
-
+    --hoodie-conf
+          Any configuration that can be set in the properties file (using the 
CLI 
+          parameter "--propsFilePath") can also be passed command line using 
this 
+          parameter 
+          Default: []
     --key-generator-class
       Subclass of com.uber.hoodie.KeyGenerator to generate a HoodieKey from
       the given avro record. Built in: SimpleKeyGenerator (uses provided field
@@ -84,7 +98,6 @@ Usage: <main class> [options]
       schema) before writing. Default : Not set. E:g -
       com.uber.hoodie.utilities.transform.SqlQueryBasedTransformer (which
       allows a SQL query template to be passed as a transformation function)
-
 ```
 
 The tool takes a hierarchically composed property file and has pluggable 
interfaces for extracting data, key generation and providing schema. Sample 
configs for ingesting from kafka and dfs are

[incubator-hudi] branch asf-site updated: Configurations update for releasing 0.4.6

Reply via email to