This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new eb9b9ca [HUDI-1782] Add more options for HUDI Flink (#2786) eb9b9ca is described below commit eb9b9ca286b638312e6af0294a4b072c62ab0ddf Author: Danny Chan <yuzhao....@gmail.com> AuthorDate: Mon Apr 12 14:25:28 2021 +0800 [HUDI-1782] Add more options for HUDI Flink (#2786) --- docs/_docs/1_6_flink_quick_start_guide.md | 14 +++++++------- docs/_docs/2_3_querying_data.md | 4 ++-- docs/_docs/2_4_configurations.md | 30 ++++++++++++++++++++++++++++++ 3 files changed, 39 insertions(+), 9 deletions(-) diff --git a/docs/_docs/1_6_flink_quick_start_guide.md b/docs/_docs/1_6_flink_quick_start_guide.md index c16028a..134b95b 100644 --- a/docs/_docs/1_6_flink_quick_start_guide.md +++ b/docs/_docs/1_6_flink_quick_start_guide.md @@ -16,8 +16,8 @@ We use the [Flink Sql Client](https://ci.apache.org/projects/flink/flink-docs-st quick start tool for SQL users. ### Step.1 download flink jar -Hudi works with Flink-1.11.x version. You can follow instructions [here](https://flink.apache.org/downloads.html) for setting up flink. -The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to use flink 1.11 bundled with scala 2.11. +Hudi works with Flink-1.12.x version. You can follow instructions [here](https://flink.apache.org/downloads.html) for setting up flink. +The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to use flink 1.12.x bundled with scala 2.11. ### Step.2 start flink cluster Start a standalone flink cluster within hadoop environment. @@ -70,7 +70,7 @@ Creates a flink hudi table first and insert data into the Hudi table using SQL ` set execution.result-mode=tableau; CREATE TABLE t1( - uuid VARCHAR(20), + uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark the field as record key name VARCHAR(10), age INT, ts TIMESTAMP(3), @@ -79,7 +79,7 @@ CREATE TABLE t1( PARTITIONED BY (`partition`) WITH ( 'connector' = 'hudi', - 'path' = 'schema://base-path', + 'path' = 'table_base_path', 'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE ); @@ -129,7 +129,7 @@ We do not need to specify endTime, if we want all changes after the given commit ```sql CREATE TABLE t1( - uuid VARCHAR(20), + uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark the field as record key name VARCHAR(10), age INT, ts TIMESTAMP(3), @@ -138,10 +138,10 @@ CREATE TABLE t1( PARTITIONED BY (`partition`) WITH ( 'connector' = 'hudi', - 'path' = 'oss://vvr-daily/hudi/t1', + 'path' = 'table_base_path', 'table.type' = 'MERGE_ON_READ', 'read.streaming.enabled' = 'true', -- this option enable the streaming read - 'read.streaming.start-commit' = '20210316134557' -- specifies the start commit instant time + 'read.streaming.start-commit' = '20210316134557', -- specifies the start commit instant time 'read.streaming.check-interval' = '4' -- specifies the check interval for finding new source commits, default 60s. ); diff --git a/docs/_docs/2_3_querying_data.md b/docs/_docs/2_3_querying_data.md index 112e497..799adfa 100644 --- a/docs/_docs/2_3_querying_data.md +++ b/docs/_docs/2_3_querying_data.md @@ -194,7 +194,7 @@ export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` ```sql -- this defines a COPY_ON_WRITE table named 't1' CREATE TABLE t1( - uuid VARCHAR(20), + uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to specify the field as record key name VARCHAR(10), age INT, ts TIMESTAMP(3), @@ -203,7 +203,7 @@ CREATE TABLE t1( PARTITIONED BY (`partition`) WITH ( 'connector' = 'hudi', - 'path' = 'schema://base-path' + 'path' = 'table_base+path' ); -- query the data diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md index d8f0c90..9dfe76d 100644 --- a/docs/_docs/2_4_configurations.md +++ b/docs/_docs/2_4_configurations.md @@ -190,6 +190,7 @@ The actual datasource level configs are listed below. | `write.ignore.failed` | N | true | <span style="color:grey"> Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default true (in favor of streaming progressing over data integrity) </span> | | `hoodie.datasource.write.recordkey.field` | N | uuid | <span style="color:grey"> Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c` </span> | | `hoodie.datasource.write.keygenerator.class` | N | SimpleAvroKeyGenerator.class | <span style="color:grey"> Key generator class, that implements will extract the key out of incoming record </span> | +| `write.partition.url_encode` | N | false | Whether to encode the partition path url, default false | | `write.tasks` | N | 4 | <span style="color:grey"> Parallelism of tasks that do actual write, default is 4 </span> | | `write.batch.size.MB` | N | 128 | <span style="color:grey"> Batch buffer size in MB to flush data into the underneath filesystem </span> | @@ -201,6 +202,9 @@ If the table type is MERGE_ON_READ, you can also specify the asynchronous compac | `compaction.trigger.strategy` | N | num_commits | <span style="color:grey"> Strategy to trigger compaction, options are 'num_commits': trigger compaction when reach N delta commits; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits' </span> | | `compaction.delta_commits` | N | 5 | <span style="color:grey"> Max delta commits needed to trigger compaction, default 5 commits </span> | | `compaction.delta_seconds` | N | 3600 | <span style="color:grey"> Max delta seconds time needed to trigger compaction, default 1 hour </span> | +| `compaction.max_memory` | N | 100 | Max memory in MB for compaction spillable map, default 100MB | +| `clean.async.enabled` | N | true | Whether to cleanup the old commits immediately on new commits, enabled by default | +| `clean.retain_commits` | N | 10 | Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 10 | ### Read Options @@ -224,6 +228,32 @@ If the table type is MERGE_ON_READ, streaming read is supported through options: | `read.streaming.check-interval` | N | 60 | <span style="color:grey"> Check interval for streaming read of SECOND, default 1 minute </span> | | `read.streaming.start-commit` | N | N/A | <span style="color:grey"> Start commit instant for streaming read, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant </span> | +### Index sync options + +| Option Name | Required | Default | Remarks | +| ----------- | ------- | ------- | ------- | +| `index.bootstrap.enabled` | N | false | Whether to bootstrap the index state from existing hoodie table, default false | + +### Hive sync options + +| Option Name | Required | Default | Remarks | +| ----------- | ------- | ------- | ------- | +| `hive_sync.enable` | N | false | Asynchronously sync Hive meta to HMS, default false | +| `hive_sync.db` | N | default | Database name for hive sync, default 'default' | +| `hive_sync.table` | N | unknown | Table name for hive sync, default 'unknown' | +| `hive_sync.file_format` | N | PARQUET | File format for hive sync, default 'PARQUET' | +| `hive_sync.username` | N | hive | Username for hive sync, default 'hive' | +| `hive_sync.password` | N | hive | Password for hive sync, default 'hive' | +| `hive_sync.jdbc_url` | N | jdbc:hive2://localhost:10000 | Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000' | +| `hive_sync.partition_fields` | N | '' | Partition fields for hive sync, default '' | +| `hive_sync.partition_extractor_class` | N | SlashEncodedDayPartitionValueExtractor.class | Tool to extract the partition value from HDFS path, default 'SlashEncodedDayPartitionValueExtractor' | +| `hive_sync.assume_date_partitioning` | N | false | Assume partitioning is yyyy/mm/dd, default false | +| `hive_sync.use_jdbc` | N | true | Use JDBC when hive synchronization is enabled, default true | +| `hive_sync.auto_create_db` | N | true | Auto create hive database if it does not exists, default true | +| `hive_sync.ignore_exceptions` | N | false | Ignore exceptions during hive synchronization, default false | +| `hive_sync.skip_ro_suffix` | N | false | Skip the _ro suffix for Read optimized table when registering, default false | +| `hive_sync.support_timestamp` | N | false | INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for backward compatibility. | + ## WriteClient Configs {#writeclient-configs} Jobs programming directly against the RDD level apis can build a `HoodieWriteConfig` object and pass it in to the `HoodieWriteClient` constructor.