This is an automated email from the ASF dual-hosted git repository.
nagarwal pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 67e4d74 Fixing docs wrt hive_sync path, record key description and
precombine description (#2511)
67e4d74 is described below
commit 67e4d747f2f221722dd67a7e3a3977ed59ee842c
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Thu Feb 4 14:09:10 2021 -0500
Fixing docs wrt hive_sync path, record key description and precombine
description (#2511)
---
docs/_docs/0_4_docker_demo.md | 4 ++--
docs/_docs/2_2_writing_data.md | 6 +++---
docs/_docs/2_4_configurations.md | 14 +++++++-------
3 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/docs/_docs/0_4_docker_demo.md b/docs/_docs/0_4_docker_demo.md
index 6c108eb..d5ec38b 100644
--- a/docs/_docs/0_4_docker_demo.md
+++ b/docs/_docs/0_4_docker_demo.md
@@ -208,7 +208,7 @@ inorder to run Hive queries against those tables.
docker exec -it adhoc-2 /bin/bash
# THis command takes in HIveServer URL and COW Hudi table location in HDFS and
sync the HDFS state to Hive
-/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh \
+/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:10000 \
--user hive \
--pass hive \
@@ -221,7 +221,7 @@ docker exec -it adhoc-2 /bin/bash
.....
# Now run hive-sync for the second data-set in HDFS using Merge-On-Read (MOR
table type)
-/var/hoodie/ws/hudi-hive-sync/run_sync_tool.sh \
+/var/hoodie/ws/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://hiveserver:10000 \
--user hive \
--pass hive \
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index 5c0a76b..07575f8 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -223,13 +223,13 @@ The `hudi-spark` module offers the DataSource API to
write (and read) a Spark Da
**`DataSourceWriteOptions`**:
-**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields
can be specified using the dot notation eg: `a.b.c`. When using multiple
columns as primary key use comma separated notation, eg:
`"col1,col2,col3,etc"`. Single or multiple columns as primary key specified by
`KEYGENERATOR_CLASS_OPT_KEY` property.<br>
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Record keys
uniquely identify a record/row within each partition. If one wants to have a
global uniqueness, there are two options. You could either make the dataset
non-partitioned, or, you can leverage Global indexes to ensure record keys are
unique irrespective of the partition path. Record keys can either be a single
column or refer to multiple columns. `KEYGENERATOR_CLASS_OPT_KEY` property
should be set accordingly based o [...]
Default value: `"uuid"`<br>
-**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for
partitioning the table. To prevent partitioning, provide empty string as value
eg: `""`. Specify partitioning/no partitioning using
`KEYGENERATOR_CLASS_OPT_KEY`. If synchronizing to hive, also specify using
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.`<br>
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for
partitioning the table. To prevent partitioning, provide empty string as value
eg: `""`. Specify partitioning/no partitioning using
`KEYGENERATOR_CLASS_OPT_KEY`. If partition path needs to be url encoded, you
can set `URL_ENCODE_PARTITIONING_OPT_KEY`. If synchronizing to hive, also
specify using `HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY.`<br>
Default value: `"partitionpath"`<br>
-**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records have the same key
value, the record with the largest value from the field specified will be
choosen.<br>
+**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records within the same
batch have the same key value, the record with the largest value from the field
specified will be choosen. If you are using default payload of
OverwriteWithLatestAvroPayload for HoodieRecordPayload (`WRITE_PAYLOAD_CLASS`),
an incoming record will always takes precendence compared to the one in storage
ignoring this `PRECOMBINE_FIELD_OPT_KEY`. <br>
Default value: `"ts"`<br>
**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use.<br>
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index ad33e66..ff25998 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -278,13 +278,17 @@ Property: `hoodie.index.type` <br/>
#### Bloom Index configs
+#### bloomIndexFilterType(bucketizedChecking = BloomFilterTypeCode.SIMPLE)
{#bloomIndexFilterType}
+Property: `hoodie.bloom.index.filter.type` <br/>
+<span style="color:grey">Filter type used. Default is
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE ,
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves
based on number of keys.</span>
+
#### bloomFilterNumEntries(numEntries = 60000) {#bloomFilterNumEntries}
Property: `hoodie.index.bloom.num_entries` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/>This is the
number of entries to be stored in the bloom filter. We assume the
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx
a total of 130K records in a file. The default (60000) is roughly half of this
approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks
computing this dynamically. Warning: Setting this very low, will generate a lot
of false positives and index l [...]
+<span style="color:grey">Only applies if index type is BLOOM. <br/>This is the
number of entries to be stored in the bloom filter. We assume the
maxParquetFileSize is 128MB and averageRecordSize is 1024B and hence we approx
a total of 130K records in a file. The default (60000) is roughly half of this
approximation. [HUDI-56](https://issues.apache.org/jira/browse/HUDI-56) tracks
computing this dynamically. Warning: Setting this very low, will generate a lot
of false positives and index l [...]
#### bloomFilterFPP(fpp = 0.000000001) {#bloomFilterFPP}
Property: `hoodie.index.bloom.fpp` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate
allowed given the number of entries. This is used to calculate how many bits
should be assigned for the bloom filter and the number of hash functions. This
is usually set very low (default: 0.000000001), we like to tradeoff disk space
for lower false positives</span>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> Error rate
allowed given the number of entries. This is used to calculate how many bits
should be assigned for the bloom filter and the number of hash functions. This
is usually set very low (default: 0.000000001), we like to tradeoff disk space
for lower false positives. If the number of entries added to bloom filter
exceeds the congfigured value (`hoodie.index.bloom.num_entries`), then this fpp
may not be honored.</span>
#### bloomIndexParallelism(0) {#bloomIndexParallelism}
Property: `hoodie.bloom.index.parallelism` <br/>
@@ -292,7 +296,7 @@ Property: `hoodie.bloom.index.parallelism` <br/>
#### bloomIndexPruneByRanges(pruneRanges = true) {#bloomIndexPruneByRanges}
Property: `hoodie.bloom.index.prune.by.ranges` <br/>
-<span style="color:grey">Only applies if index type is BLOOM. <br/> When true,
range information from files to leveraged speed up index lookups. Particularly
helpful, if the key has a monotonously increasing prefix, such as
timestamp.</span>
+<span style="color:grey">Only applies if index type is BLOOM. <br/> When true,
range information from files to leveraged speed up index lookups. Particularly
helpful, if the key has a monotonously increasing prefix, such as timestamp. If
the record key is completely random, it is better to turn this off.</span>
#### bloomIndexUseCaching(useCaching = true) {#bloomIndexUseCaching}
Property: `hoodie.bloom.index.use.caching` <br/>
@@ -306,10 +310,6 @@ Property: `hoodie.bloom.index.use.treebased.filter` <br/>
Property: `hoodie.bloom.index.bucketized.checking` <br/>
<span style="color:grey">Only applies if index type is BLOOM. <br/> When true,
bucketized bloom filtering is enabled. This reduces skew seen in sort based
bloom index lookup</span>
-#### bloomIndexFilterType(bucketizedChecking = BloomFilterTypeCode.SIMPLE)
{#bloomIndexFilterType}
-Property: `hoodie.bloom.index.filter.type` <br/>
-<span style="color:grey">Filter type used. Default is
BloomFilterTypeCode.SIMPLE. Available values are [BloomFilterTypeCode.SIMPLE ,
BloomFilterTypeCode.DYNAMIC_V0]. Dynamic bloom filters auto size themselves
based on number of keys</span>
-
#### bloomIndexFilterDynamicMaxEntries(maxNumberOfEntries = 100000)
{#bloomIndexFilterDynamicMaxEntries}
Property: `hoodie.bloom.index.filter.dynamic.max.entries` <br/>
<span style="color:grey">The threshold for the maximum number of keys to
record in a dynamic Bloom filter row. Only applies if filter type is
BloomFilterTypeCode.DYNAMIC_V0.</span>