[hudi] branch asf-site updated: [HUDI-1782] Add more options for HUDI Flink (#2786)

vinoyang Sun, 11 Apr 2021 23:25:46 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new eb9b9ca  [HUDI-1782] Add more options for HUDI Flink (#2786)
eb9b9ca is described below

commit eb9b9ca286b638312e6af0294a4b072c62ab0ddf
Author: Danny Chan <yuzhao....@gmail.com>
AuthorDate: Mon Apr 12 14:25:28 2021 +0800

    [HUDI-1782] Add more options for HUDI Flink (#2786)
---
 docs/_docs/1_6_flink_quick_start_guide.md | 14 +++++++-------
 docs/_docs/2_3_querying_data.md           |  4 ++--
 docs/_docs/2_4_configurations.md          | 30 ++++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/docs/_docs/1_6_flink_quick_start_guide.md 
b/docs/_docs/1_6_flink_quick_start_guide.md
index c16028a..134b95b 100644
--- a/docs/_docs/1_6_flink_quick_start_guide.md
+++ b/docs/_docs/1_6_flink_quick_start_guide.md
@@ -16,8 +16,8 @@ We use the [Flink Sql 
Client](https://ci.apache.org/projects/flink/flink-docs-st
 quick start tool for SQL users.
 
 ### Step.1 download flink jar
-Hudi works with Flink-1.11.x version. You can follow instructions 
[here](https://flink.apache.org/downloads.html) for setting up flink.
-The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to 
use flink 1.11 bundled with scala 2.11.
+Hudi works with Flink-1.12.x version. You can follow instructions 
[here](https://flink.apache.org/downloads.html) for setting up flink.
+The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to 
use flink 1.12.x bundled with scala 2.11.
 
 ### Step.2 start flink cluster
 Start a standalone flink cluster within hadoop environment.
@@ -70,7 +70,7 @@ Creates a flink hudi table first and insert data into the 
Hudi table using SQL `
 set execution.result-mode=tableau;
 
 CREATE TABLE t1(
-  uuid VARCHAR(20),
+  uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark 
the field as record key
   name VARCHAR(10),
   age INT,
   ts TIMESTAMP(3),
@@ -79,7 +79,7 @@ CREATE TABLE t1(
 PARTITIONED BY (`partition`)
 WITH (
   'connector' = 'hudi',
-  'path' = 'schema://base-path',
+  'path' = 'table_base_path',
   'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by 
default is COPY_ON_WRITE
 );
 
@@ -129,7 +129,7 @@ We do not need to specify endTime, if we want all changes 
after the given commit
 
 ```sql
 CREATE TABLE t1(
-  uuid VARCHAR(20),
+  uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark 
the field as record key
   name VARCHAR(10),
   age INT,
   ts TIMESTAMP(3),
@@ -138,10 +138,10 @@ CREATE TABLE t1(
 PARTITIONED BY (`partition`)
 WITH (
   'connector' = 'hudi',
-  'path' = 'oss://vvr-daily/hudi/t1',
+  'path' = 'table_base_path',
   'table.type' = 'MERGE_ON_READ',
   'read.streaming.enabled' = 'true',  -- this option enable the streaming read
-  'read.streaming.start-commit' = '20210316134557' -- specifies the start 
commit instant time
+  'read.streaming.start-commit' = '20210316134557', -- specifies the start 
commit instant time
   'read.streaming.check-interval' = '4' -- specifies the check interval for 
finding new source commits, default 60s.
 );
 
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 112e497..799adfa 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -194,7 +194,7 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
 ```sql
 -- this defines a COPY_ON_WRITE table named 't1'
 CREATE TABLE t1(
-  uuid VARCHAR(20),
+  uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to 
specify the field as record key
   name VARCHAR(10),
   age INT,
   ts TIMESTAMP(3),
@@ -203,7 +203,7 @@ CREATE TABLE t1(
 PARTITIONED BY (`partition`)
 WITH (
   'connector' = 'hudi',
-  'path' = 'schema://base-path'
+  'path' = 'table_base+path'
 );
 
 -- query the data
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index d8f0c90..9dfe76d 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -190,6 +190,7 @@ The actual datasource level configs are listed below.
 | `write.ignore.failed` | N | true | <span style="color:grey"> Flag to 
indicate whether to ignore any non exception error (e.g. writestatus error). 
within a checkpoint batch. By default true (in favor of streaming progressing 
over data integrity) </span> |
 | `hoodie.datasource.write.recordkey.field` | N | uuid | <span 
style="color:grey"> Record key field. Value to be used as the `recordKey` 
component of `HoodieKey`. Actual value will be obtained by invoking .toString() 
on the field value. Nested fields can be specified using the dot notation eg: 
`a.b.c` </span> |
 | `hoodie.datasource.write.keygenerator.class` | N | 
SimpleAvroKeyGenerator.class | <span style="color:grey"> Key generator class, 
that implements will extract the key out of incoming record </span> |
+| `write.partition.url_encode` | N | false | Whether to encode the partition 
path url, default false |
 | `write.tasks` | N | 4 | <span style="color:grey"> Parallelism of tasks that 
do actual write, default is 4 </span> |
 | `write.batch.size.MB` | N | 128 | <span style="color:grey"> Batch buffer 
size in MB to flush data into the underneath filesystem </span> |
 
@@ -201,6 +202,9 @@ If the table type is MERGE_ON_READ, you can also specify 
the asynchronous compac
 | `compaction.trigger.strategy` | N | num_commits | <span style="color:grey"> 
Strategy to trigger compaction, options are 'num_commits': trigger compaction 
when reach N delta commits; 'time_elapsed': trigger compaction when time 
elapsed > N seconds since last compaction; 'num_and_time': trigger compaction 
when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger 
compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 
'num_commits' </span> |
 | `compaction.delta_commits` | N | 5 | <span style="color:grey"> Max delta 
commits needed to trigger compaction, default 5 commits </span> |
 | `compaction.delta_seconds` | N | 3600 | <span style="color:grey"> Max delta 
seconds time needed to trigger compaction, default 1 hour </span> |
+| `compaction.max_memory` | N | 100 | Max memory in MB for compaction 
spillable map, default 100MB |
+| `clean.async.enabled` | N | true | Whether to cleanup the old commits 
immediately on new commits, enabled by default |
+| `clean.retain_commits` | N | 10 | Number of commits to retain. So data will 
be retained for num_of_commits * time_between_commits (scheduled). This also 
directly translates into how much you can incrementally pull on this table, 
default 10 |
 
 ### Read Options
 
@@ -224,6 +228,32 @@ If the table type is MERGE_ON_READ, streaming read is 
supported through options:
 | `read.streaming.check-interval` | N | 60 | <span style="color:grey"> Check 
interval for streaming read of SECOND, default 1 minute </span> |
 | `read.streaming.start-commit` | N | N/A | <span style="color:grey"> Start 
commit instant for streaming read, the commit time format should be 
'yyyyMMddHHmmss', by default reading from the latest instant </span> |
 
+### Index sync options
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `index.bootstrap.enabled` | N | false | Whether to bootstrap the index state 
from existing hoodie table, default false |
+
+### Hive sync options
+
+|  Option Name  | Required | Default | Remarks |
+|  -----------  | -------  | ------- | ------- |
+| `hive_sync.enable` | N | false | Asynchronously sync Hive meta to HMS, 
default false |
+| `hive_sync.db` | N | default | Database name for hive sync, default 
'default' |
+| `hive_sync.table` | N | unknown | Table name for hive sync, default 
'unknown' |
+| `hive_sync.file_format` | N | PARQUET | File format for hive sync, default 
'PARQUET' |
+| `hive_sync.username` | N | hive | Username for hive sync, default 'hive' |
+| `hive_sync.password` | N | hive | Password for hive sync, default 'hive' |
+| `hive_sync.jdbc_url` | N | jdbc:hive2://localhost:10000 | Jdbc URL for hive 
sync, default 'jdbc:hive2://localhost:10000' |
+| `hive_sync.partition_fields` | N | '' | Partition fields for hive sync, 
default '' |
+| `hive_sync.partition_extractor_class` | N | 
SlashEncodedDayPartitionValueExtractor.class | Tool to extract the partition 
value from HDFS path, default 'SlashEncodedDayPartitionValueExtractor' |
+| `hive_sync.assume_date_partitioning` | N | false | Assume partitioning is 
yyyy/mm/dd, default false |
+| `hive_sync.use_jdbc` | N | true | Use JDBC when hive synchronization is 
enabled, default true |
+| `hive_sync.auto_create_db` | N | true | Auto create hive database if it does 
not exists, default true |
+| `hive_sync.ignore_exceptions` | N | false | Ignore exceptions during hive 
synchronization, default false |
+| `hive_sync.skip_ro_suffix` | N | false | Skip the _ro suffix for Read 
optimized table when registering, default false |
+| `hive_sync.support_timestamp` | N | false | INT64 with original type 
TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for 
backward compatibility. |
+
 ## WriteClient Configs {#writeclient-configs}
 
 Jobs programming directly against the RDD level apis can build a 
`HoodieWriteConfig` object and pass it in to the `HoodieWriteClient` 
constructor.

[hudi] branch asf-site updated: [HUDI-1782] Add more options for HUDI Flink (#2786)

Reply via email to