This is an automated email from the ASF dual-hosted git repository.
vinoyang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new eb9b9ca [HUDI-1782] Add more options for HUDI Flink (#2786)
eb9b9ca is described below
commit eb9b9ca286b638312e6af0294a4b072c62ab0ddf
Author: Danny Chan <[email protected]>
AuthorDate: Mon Apr 12 14:25:28 2021 +0800
[HUDI-1782] Add more options for HUDI Flink (#2786)
---
docs/_docs/1_6_flink_quick_start_guide.md | 14 +++++++-------
docs/_docs/2_3_querying_data.md | 4 ++--
docs/_docs/2_4_configurations.md | 30 ++++++++++++++++++++++++++++++
3 files changed, 39 insertions(+), 9 deletions(-)
diff --git a/docs/_docs/1_6_flink_quick_start_guide.md
b/docs/_docs/1_6_flink_quick_start_guide.md
index c16028a..134b95b 100644
--- a/docs/_docs/1_6_flink_quick_start_guide.md
+++ b/docs/_docs/1_6_flink_quick_start_guide.md
@@ -16,8 +16,8 @@ We use the [Flink Sql
Client](https://ci.apache.org/projects/flink/flink-docs-st
quick start tool for SQL users.
### Step.1 download flink jar
-Hudi works with Flink-1.11.x version. You can follow instructions
[here](https://flink.apache.org/downloads.html) for setting up flink.
-The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to
use flink 1.11 bundled with scala 2.11.
+Hudi works with Flink-1.12.x version. You can follow instructions
[here](https://flink.apache.org/downloads.html) for setting up flink.
+The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to
use flink 1.12.x bundled with scala 2.11.
### Step.2 start flink cluster
Start a standalone flink cluster within hadoop environment.
@@ -70,7 +70,7 @@ Creates a flink hudi table first and insert data into the
Hudi table using SQL `
set execution.result-mode=tableau;
CREATE TABLE t1(
- uuid VARCHAR(20),
+ uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark
the field as record key
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
@@ -79,7 +79,7 @@ CREATE TABLE t1(
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
- 'path' = 'schema://base-path',
+ 'path' = 'table_base_path',
'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by
default is COPY_ON_WRITE
);
@@ -129,7 +129,7 @@ We do not need to specify endTime, if we want all changes
after the given commit
```sql
CREATE TABLE t1(
- uuid VARCHAR(20),
+ uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark
the field as record key
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
@@ -138,10 +138,10 @@ CREATE TABLE t1(
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
- 'path' = 'oss://vvr-daily/hudi/t1',
+ 'path' = 'table_base_path',
'table.type' = 'MERGE_ON_READ',
'read.streaming.enabled' = 'true', -- this option enable the streaming read
- 'read.streaming.start-commit' = '20210316134557' -- specifies the start
commit instant time
+ 'read.streaming.start-commit' = '20210316134557', -- specifies the start
commit instant time
'read.streaming.check-interval' = '4' -- specifies the check interval for
finding new source commits, default 60s.
);
diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md
index 112e497..799adfa 100644
--- a/docs/_docs/2_3_querying_data.md
+++ b/docs/_docs/2_3_querying_data.md
@@ -194,7 +194,7 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
```sql
-- this defines a COPY_ON_WRITE table named 't1'
CREATE TABLE t1(
- uuid VARCHAR(20),
+ uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to
specify the field as record key
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
@@ -203,7 +203,7 @@ CREATE TABLE t1(
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
- 'path' = 'schema://base-path'
+ 'path' = 'table_base+path'
);
-- query the data
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index d8f0c90..9dfe76d 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -190,6 +190,7 @@ The actual datasource level configs are listed below.
| `write.ignore.failed` | N | true | <span style="color:grey"> Flag to
indicate whether to ignore any non exception error (e.g. writestatus error).
within a checkpoint batch. By default true (in favor of streaming progressing
over data integrity) </span> |
| `hoodie.datasource.write.recordkey.field` | N | uuid | <span
style="color:grey"> Record key field. Value to be used as the `recordKey`
component of `HoodieKey`. Actual value will be obtained by invoking .toString()
on the field value. Nested fields can be specified using the dot notation eg:
`a.b.c` </span> |
| `hoodie.datasource.write.keygenerator.class` | N |
SimpleAvroKeyGenerator.class | <span style="color:grey"> Key generator class,
that implements will extract the key out of incoming record </span> |
+| `write.partition.url_encode` | N | false | Whether to encode the partition
path url, default false |
| `write.tasks` | N | 4 | <span style="color:grey"> Parallelism of tasks that
do actual write, default is 4 </span> |
| `write.batch.size.MB` | N | 128 | <span style="color:grey"> Batch buffer
size in MB to flush data into the underneath filesystem </span> |
@@ -201,6 +202,9 @@ If the table type is MERGE_ON_READ, you can also specify
the asynchronous compac
| `compaction.trigger.strategy` | N | num_commits | <span style="color:grey">
Strategy to trigger compaction, options are 'num_commits': trigger compaction
when reach N delta commits; 'time_elapsed': trigger compaction when time
elapsed > N seconds since last compaction; 'num_and_time': trigger compaction
when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger
compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is
'num_commits' </span> |
| `compaction.delta_commits` | N | 5 | <span style="color:grey"> Max delta
commits needed to trigger compaction, default 5 commits </span> |
| `compaction.delta_seconds` | N | 3600 | <span style="color:grey"> Max delta
seconds time needed to trigger compaction, default 1 hour </span> |
+| `compaction.max_memory` | N | 100 | Max memory in MB for compaction
spillable map, default 100MB |
+| `clean.async.enabled` | N | true | Whether to cleanup the old commits
immediately on new commits, enabled by default |
+| `clean.retain_commits` | N | 10 | Number of commits to retain. So data will
be retained for num_of_commits * time_between_commits (scheduled). This also
directly translates into how much you can incrementally pull on this table,
default 10 |
### Read Options
@@ -224,6 +228,32 @@ If the table type is MERGE_ON_READ, streaming read is
supported through options:
| `read.streaming.check-interval` | N | 60 | <span style="color:grey"> Check
interval for streaming read of SECOND, default 1 minute </span> |
| `read.streaming.start-commit` | N | N/A | <span style="color:grey"> Start
commit instant for streaming read, the commit time format should be
'yyyyMMddHHmmss', by default reading from the latest instant </span> |
+### Index sync options
+
+| Option Name | Required | Default | Remarks |
+| ----------- | ------- | ------- | ------- |
+| `index.bootstrap.enabled` | N | false | Whether to bootstrap the index state
from existing hoodie table, default false |
+
+### Hive sync options
+
+| Option Name | Required | Default | Remarks |
+| ----------- | ------- | ------- | ------- |
+| `hive_sync.enable` | N | false | Asynchronously sync Hive meta to HMS,
default false |
+| `hive_sync.db` | N | default | Database name for hive sync, default
'default' |
+| `hive_sync.table` | N | unknown | Table name for hive sync, default
'unknown' |
+| `hive_sync.file_format` | N | PARQUET | File format for hive sync, default
'PARQUET' |
+| `hive_sync.username` | N | hive | Username for hive sync, default 'hive' |
+| `hive_sync.password` | N | hive | Password for hive sync, default 'hive' |
+| `hive_sync.jdbc_url` | N | jdbc:hive2://localhost:10000 | Jdbc URL for hive
sync, default 'jdbc:hive2://localhost:10000' |
+| `hive_sync.partition_fields` | N | '' | Partition fields for hive sync,
default '' |
+| `hive_sync.partition_extractor_class` | N |
SlashEncodedDayPartitionValueExtractor.class | Tool to extract the partition
value from HDFS path, default 'SlashEncodedDayPartitionValueExtractor' |
+| `hive_sync.assume_date_partitioning` | N | false | Assume partitioning is
yyyy/mm/dd, default false |
+| `hive_sync.use_jdbc` | N | true | Use JDBC when hive synchronization is
enabled, default true |
+| `hive_sync.auto_create_db` | N | true | Auto create hive database if it does
not exists, default true |
+| `hive_sync.ignore_exceptions` | N | false | Ignore exceptions during hive
synchronization, default false |
+| `hive_sync.skip_ro_suffix` | N | false | Skip the _ro suffix for Read
optimized table when registering, default false |
+| `hive_sync.support_timestamp` | N | false | INT64 with original type
TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for
backward compatibility. |
+
## WriteClient Configs {#writeclient-configs}
Jobs programming directly against the RDD level apis can build a
`HoodieWriteConfig` object and pass it in to the `HoodieWriteClient`
constructor.