[hudi] branch asf-site updated: Fixing docs wrt hive_sync path, record key description and precombine description (#2511)

nagarwal Thu, 04 Feb 2021 11:09:28 -0800

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 67e4d74  Fixing docs wrt hive_sync path, record key description and 
precombine description (#2511)
67e4d74 is described below

commit 67e4d747f2f221722dd67a7e3a3977ed59ee842c
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Thu Feb 4 14:09:10 2021 -0500

    Fixing docs wrt hive_sync path, record key description and precombine 
description (#2511)
---
 docs/_docs/0_4_docker_demo.md    |  4 ++--
 docs/_docs/2_2_writing_data.md   |  6 +++---
 docs/_docs/2_4_configurations.md | 14 +++++++-------
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/docs/_docs/0_4_docker_demo.md b/docs/_docs/0_4_docker_demo.md
index 6c108eb..d5ec38b 100644
--- a/docs/_docs/0_4_docker_demo.md
+++ b/docs/_docs/0_4_docker_demo.md
@@ -208,7 +208,7 @@ inorder to run Hive queries against those tables.
 docker exec -it adhoc-2 /bin/bash
 
 # THis command takes in HIveServer URL and COW Hudi table location in HDFS and 
sync the HDFS state to Hive
-/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh \
+/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
   --jdbc-url jdbc:hive2://hiveserver:10000 \
   --user hive \
   --pass hive \
@@ -221,7 +221,7 @@ docker exec -it adhoc-2 /bin/bash
 .....
 
 # Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR 
table type)
-/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh \
+/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
   --jdbc-url jdbc:hive2://hiveserver:10000 \
   --user hive \
   --pass hive \
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index 5c0a76b..07575f8 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -223,13 +223,13 @@ The `hudi-spark` module offers the DataSource API to 
write (and read) a Spark Da
 
 **`DataSourceWriteOptions`**:
 
-**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields 
can be specified using the dot notation eg: `a.b.c`. When using multiple 
columns as primary key use comma separated notation, eg: 
`"col1,col2,col3,etc"`. Single or multiple columns as primary key specified by 
`KEYGENERATOR_CLASS_OPT_KEY` property.<br>
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Record keys 
uniquely identify a record/row within each partition. If one wants to have a 
global uniqueness, there are two options. You could either make the dataset 
non-partitioned, or, you can leverage Global indexes to ensure record keys are 
unique irrespective of the partition path. Record keys can either be a single 
column or refer to multiple columns. `KEYGENERATOR_CLASS_OPT_KEY` property 
should be set accordingly based o [...]
 Default value: `"uuid"`<br>
 
-**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for 
partitioning the table. To prevent partitioning, provide empty string as value 
eg: `""`. Specify partitioning/no partitioning using 
`KEYGENERATOR_CLASS_OPT_KEY`. If synchronizing to hive, also specify using 
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.`<br>
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for 
partitioning the table. To prevent partitioning, provide empty string as value 
eg: `""`. Specify partitioning/no partitioning using 
`KEYGENERATOR_CLASS_OPT_KEY`. If partition path needs to be url encoded, you 
can set `URL_ENCODE_PARTITIONING_OPT_KEY`. If synchronizing to hive, also 
specify using `HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.`<br>
 Default value: `"partitionpath"`<br>
 
-**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records have the same key 
value, the record with the largest value from the field specified will be 
choosen.<br>
+**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records within the same 
batch have the same key value, the record with the largest value from the field 
specified will be choosen. If you are using default payload of 
OverwriteWithLatestAvroPayload for HoodieRecordPayload (`WRITE_PAYLOAD_CLASS`), 
an incoming record will always takes precendence compared to the one in storage 
ignoring this `PRECOMBINE_FIELD_OPT_KEY`. <br>
 Default value: `"ts"`<br>
 
 **OPERATION_OPT_KEY**: The [write operations](#write-operations) to use.<br>
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index ad33e66..ff25998 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -278,13 +278,17 @@ Property: `hoodie.index.type` <br/>
 
 #### Bloom Index configs
 
+#### bloomIndexFilterType(bucketizedChecking = BloomFilterTypeCode.SIMPLE) 
{#bloomIndexFilterType}
+Property: `hoodie.bloom.index.filter.type` <br/>
+<span style="color:grey">Filter type used. Default is 
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE , 
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves 
based on number of keys.</span>
+
 #### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries}
 Property: `hoodie.index.bloom.num_entries` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/>This is the 
number of entries to be stored in the bloom filter. We assume the 
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx 
a total of 130K records in a file. The default (60000) is roughly half of this 
approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks 
computing this dynamically. Warning: Setting this very low, will generate a lot 
of false positives and index l [...]
+<span style="color:grey">Only applies if index type is BLOOM. <br/>This is the 
number of entries to be stored in the bloom filter. We assume the 
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx 
a total of 130K records in a file. The default (60000) is roughly half of this 
approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks 
computing this dynamically. Warning: Setting this very low, will generate a lot 
of false positives and index l [...]
 
 #### bloomFilterFPP(fpp = 0.000000001) {#bloomFilterFPP}
 Property: `hoodie.index.bloom.fpp` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate 
allowed given the number of entries. This is used to calculate how many bits 
should be assigned for the bloom filter and the number of hash functions. This 
is usually set very low (default: 0.000000001), we like to tradeoff disk space 
for lower false positives</span>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate 
allowed given the number of entries. This is used to calculate how many bits 
should be assigned for the bloom filter and the number of hash functions. This 
is usually set very low (default: 0.000000001), we like to tradeoff disk space 
for lower false positives. If the number of entries added to bloom filter 
exceeds the congfigured value (`hoodie.index.bloom.num_entries`), then this fpp 
may not be honored.</span>
 
 #### bloomIndexParallelism(0) {#bloomIndexParallelism}
 Property: `hoodie.bloom.index.parallelism` <br/>
@@ -292,7 +296,7 @@ Property: `hoodie.bloom.index.parallelism` <br/>
 
 #### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges}
 Property: `hoodie.bloom.index.prune.by.ranges` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
range information from files to leveraged speed up index lookups. Particularly 
helpful, if the key has a monotonously increasing prefix, such as 
timestamp.</span>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
range information from files to leveraged speed up index lookups. Particularly 
helpful, if the key has a monotonously increasing prefix, such as timestamp. If 
the record key is completely random, it is better to turn this off.</span>
 
 #### bloomIndexUseCaching(useCaching = true) {#bloomIndexUseCaching}
 Property: `hoodie.bloom.index.use.caching` <br/>
@@ -306,10 +310,6 @@ Property: `hoodie.bloom.index.use.treebased.filter` <br/>
 Property: `hoodie.bloom.index.bucketized.checking` <br/>
 <span style="color:grey">Only applies if index type is BLOOM. <br/> When true, 
bucketized bloom filtering is enabled. This reduces skew seen in sort based 
bloom index lookup</span>
 
-#### bloomIndexFilterType(bucketizedChecking = BloomFilterTypeCode.SIMPLE) 
{#bloomIndexFilterType}
-Property: `hoodie.bloom.index.filter.type` <br/>
-<span style="color:grey">Filter type used. Default is 
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE , 
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves 
based on number of keys</span>
-
 #### bloomIndexFilterDynamicMaxEntries(maxNumberOfEntries = 100000) 
{#bloomIndexFilterDynamicMaxEntries}
 Property: `hoodie.bloom.index.filter.dynamic.max.entries` <br/>
 <span style="color:grey">The threshold for the maximum number of keys to 
record in a dynamic Bloom filter row. Only applies if filter type is 
BloomFilterTypeCode.DYNAMIC_V0.</span>

[hudi] branch asf-site updated: Fixing docs wrt hive_sync path, record key description and precombine description (#2511)

Reply via email to