This is an automated email from the ASF dual-hosted git repository.
liugddx pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git
The following commit(s) were added to refs/heads/dev by this push:
new 659a68a0be [Doc][Iceberg] Improved iceberg documentation (#5335)
659a68a0be is described below
commit 659a68a0be295350891a844d8d903bdbbc3c5b4f
Author: Carl-Zhou-CN <[email protected]>
AuthorDate: Wed Aug 23 22:55:08 2023 +0800
[Doc][Iceberg] Improved iceberg documentation (#5335)
* [Doc][Iceberg] Improved iceberg documentation
* [Doc][Iceberg] Improved iceberg documentation
---------
Co-authored-by: zhouyao <[email protected]>
---
docs/en/connector-v2/source/Iceberg.md | 222 ++++++++++-----------
.../seatunnel/iceberg/config/CommonConfig.java | 7 -
2 files changed, 104 insertions(+), 125 deletions(-)
diff --git a/docs/en/connector-v2/source/Iceberg.md
b/docs/en/connector-v2/source/Iceberg.md
index 6a42ee0ddd..b6d3924b95 100644
--- a/docs/en/connector-v2/source/Iceberg.md
+++ b/docs/en/connector-v2/source/Iceberg.md
@@ -2,9 +2,15 @@
> Apache Iceberg source connector
-## Description
+## Support Iceberg Version
-Source connector for Apache Iceberg. It can support batch and stream mode.
+- 0.14.0
+
+## Support Those Engines
+
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
## Key features
@@ -22,126 +28,120 @@ Source connector for Apache Iceberg. It can support batch
and stream mode.
- [x] hadoop(2.7.1 , 2.7.5 , 3.1.3)
- [x] hive(2.3.9 , 3.1.2)
-## Options
-
-| name | type | required | default value |
-|--------------------------|---------|----------|----------------------|
-| catalog_name | string | yes | - |
-| catalog_type | string | yes | - |
-| uri | string | no | - |
-| warehouse | string | yes | - |
-| namespace | string | yes | - |
-| table | string | yes | - |
-| schema | config | no | - |
-| case_sensitive | boolean | no | false |
-| start_snapshot_timestamp | long | no | - |
-| start_snapshot_id | long | no | - |
-| end_snapshot_id | long | no | - |
-| use_snapshot_id | long | no | - |
-| use_snapshot_timestamp | long | no | - |
-| stream_scan_strategy | enum | no | FROM_LATEST_SNAPSHOT |
-| common-options | | no | - |
-
-### catalog_name [string]
-
-User-specified catalog name.
-
-### catalog_type [string]
-
-The optional values are:
-- hive: The hive metastore catalog.
-- hadoop: The hadoop catalog.
-
-### uri [string]
-
-The Hive metastore’s thrift URI.
-
-### warehouse [string]
-
-The location to store metadata files and data files.
-
-### namespace [string]
-
-The iceberg database name in the backend catalog.
-
-### table [string]
-
-The iceberg table name in the backend catalog.
-
-### case_sensitive [boolean]
+## Description
-If data columns where selected via schema [config], controls whether the match
to the schema will be done with case sensitivity.
+Source connector for Apache Iceberg. It can support batch and stream mode.
-### schema [config]
+## Supported DataSource Info
-#### fields [Config]
+| Datasource | Dependent | Maven
|
+|------------|---------------------|---------------------------------------------------------------------------|
+| Iceberg | flink-shaded-hadoop |
[Download](https://mvnrepository.com/search?q=flink-shaded-hadoop-) |
+| Iceberg | hive-exec |
[Download](https://mvnrepository.com/artifact/org.apache.hive/hive-exec) |
+| Iceberg | libfb303 |
[Download](https://mvnrepository.com/artifact/org.apache.thrift/libfb303) |
-Use projection to select data columns and columns order.
+## Database Dependency
-e.g.
+> In order to be compatible with different versions of Hadoop and Hive, the
scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are
provided, so if you use the Flink engine, first you may need to add the
following Jar packages to <FLINK_HOME>/lib directory, if you are using the
Spark engine and integrated with Hadoop, then you do not need to add the
following Jar packages.
```
-schema {
- fields {
- f2 = "boolean"
- f1 = "bigint"
- f3 = "int"
- f4 = "bigint"
- }
-}
+flink-shaded-hadoop-x-xxx.jar
+hive-exec-xxx.jar
+libfb303-xxx.jar
```
-### start_snapshot_id [long]
-
-Instructs this scan to look for changes starting from a particular snapshot
(exclusive).
-
-### start_snapshot_timestamp [long]
-
-Instructs this scan to look for changes starting from the most recent
snapshot for the table as of the timestamp. timestamp – the timestamp in millis
since the Unix epoch
-
-### end_snapshot_id [long]
-
-Instructs this scan to look for changes up to a particular snapshot
(inclusive).
-
-### use_snapshot_id [long]
-
-Instructs this scan to look for use the given snapshot ID.
-
-### use_snapshot_timestamp [long]
-
-Instructs this scan to look for use the most recent snapshot as of the given
time in milliseconds. timestamp – the timestamp in millis since the Unix epoch
-
-### stream_scan_strategy [enum]
-
-Starting strategy for stream mode execution, Default to use
`FROM_LATEST_SNAPSHOT` if don’t specify any value.
-The optional values are:
-- TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the
incremental mode.
-- FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot
inclusive.
-- FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot
inclusive.
-- FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id
inclusive.
-- FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a
specific timestamp inclusive.
-
-### common options
-
-Source plugin common parameters, please refer to [Source Common
Options](common-options.md) for details.
-
-## Example
-
-simple
+> Some versions of the hive-exec package do not have libfb303-xxx.jar, so you
also need to manually import the Jar package.
+
+## Data Type Mapping
+
+| Iceberg Data type | SeaTunnel Data type |
+|-------------------|---------------------|
+| BOOLEAN | BOOLEAN |
+| INTEGER | INT |
+| LONG | BIGINT |
+| FLOAT | FLOAT |
+| DOUBLE | DOUBLE |
+| DATE | DATE |
+| TIME | TIME |
+| TIMESTAMP | TIMESTAMP |
+| STRING | STRING |
+| FIXED<br/>BINARY | BYTES |
+| DECIMAL | DECIMAL |
+| STRUCT | ROW |
+| LIST | ARRAY |
+| MAP | MAP |
+
+## Source Options
+
+| Name | Type | Required | Default |
Description
[...]
+|--------------------------|---------|----------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| catalog_name | string | yes | - |
User-specified catalog name.
[...]
+| catalog_type | string | yes | - | The
optional values are: hive(The hive metastore catalog),hadoop(The hadoop
catalog)
[...]
+| uri | string | no | - | The
Hive metastore’s thrift URI.
[...]
+| warehouse | string | yes | - | The
location to store metadata files and data files.
[...]
+| namespace | string | yes | - | The
iceberg database name in the backend catalog.
[...]
+| table | string | yes | - | The
iceberg table name in the backend catalog.
[...]
+| schema | config | no | - | Use
projection to select data columns and columns order.
[...]
+| case_sensitive | boolean | no | false | If
data columns where selected via schema [config], controls whether the match to
the schema will be done with case sensitivity.
[...]
+| start_snapshot_timestamp | long | no | - |
Instructs this scan to look for changes starting from the most recent snapshot
for the table as of the timestamp. <br/>timestamp – the timestamp in millis
since the Unix epoch
[...]
+| start_snapshot_id | long | no | - |
Instructs this scan to look for changes starting from a particular snapshot
(exclusive).
[...]
+| end_snapshot_id | long | no | - |
Instructs this scan to look for changes up to a particular snapshot
(inclusive).
[...]
+| use_snapshot_id | long | no | - |
Instructs this scan to look for use the given snapshot ID.
[...]
+| use_snapshot_timestamp | long | no | - |
Instructs this scan to look for use the most recent snapshot as of the given
time in milliseconds. timestamp – the timestamp in millis since the Unix epoch
[...]
+| stream_scan_strategy | enum | no | FROM_LATEST_SNAPSHOT |
Starting strategy for stream mode execution, Default to use
`FROM_LATEST_SNAPSHOT` if don’t specify any value,The optional values
are:<br/>TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to
the incremental mode.<br/>FROM_LATEST_SNAPSHOT: Start incremental mode from the
latest snapshot inclusive.<br/>FROM_EARLIEST_SNAPSHOT: Start incremental mode
from the earliest snapshot inclusive.<br/>FROM_SNAPSHO [...]
+| common-options | | no | - |
Source plugin common parameters, please refer to [Source Common
Options](common-options.md) for details.
[...]
+
+## Task Example
+
+### Simple:
```hocon
+env {
+ execution.parallelism = 2
+ job.mode = "BATCH"
+}
+
source {
Iceberg {
+ schema {
+ fields {
+ f2 = "boolean"
+ f1 = "bigint"
+ f3 = "int"
+ f4 = "bigint"
+ f5 = "float"
+ f6 = "double"
+ f7 = "date"
+ f9 = "timestamp"
+ f10 = "timestamp"
+ f11 = "string"
+ f12 = "bytes"
+ f13 = "bytes"
+ f14 = "decimal(19,9)"
+ f15 = "array<int>"
+ f16 = "map<string, int>"
+ }
+ }
catalog_name = "seatunnel"
catalog_type = "hadoop"
- warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
- namespace = "your_iceberg_database"
- table = "your_iceberg_table"
+ warehouse = "file:///tmp/seatunnel/iceberg/hadoop/"
+ namespace = "database1"
+ table = "source"
+ result_table_name = "iceberg"
+ }
+}
+
+transform {
+}
+
+sink {
+ Console {
+ source_table_name = "iceberg"
}
}
```
-Or
+### Hive Catalog:
```hocon
source {
@@ -156,7 +156,7 @@ source {
}
```
-column projection
+### Column Projection:
```hocon
source {
@@ -179,20 +179,6 @@ source {
}
```
-:::tip
-
-In order to be compatible with different versions of Hadoop and Hive, the
scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are
provided, so if you use the Flink engine, first you may need to add the
following Jar packages to <FLINK_HOME>/lib directory, if you are using the
Spark engine and integrated with Hadoop, then you do not need to add the
following Jar packages.
-
-:::
-
-```
-flink-shaded-hadoop-x-xxx.jar
-hive-exec-xxx.jar
-libfb303-xxx.jar
-```
-
-Some versions of the hive-exec package do not have libfb303-xxx.jar, so you
also need to manually import the Jar package.
-
## Changelog
### 2.2.0-beta 2022-09-26
diff --git
a/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
b/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
index ac9f8c12bb..2f893da092 100644
---
a/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
+++
b/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
@@ -26,7 +26,6 @@ import lombok.Getter;
import lombok.ToString;
import java.io.Serializable;
-import java.util.List;
import static
org.apache.seatunnel.connectors.seatunnel.iceberg.config.IcebergCatalogType.HADOOP;
import static
org.apache.seatunnel.connectors.seatunnel.iceberg.config.IcebergCatalogType.HIVE;
@@ -80,12 +79,6 @@ public class CommonConfig implements Serializable {
.defaultValue(false)
.withDescription(" the iceberg case_sensitive");
- public static final Option<List<String>> KEY_FIELDS =
- Options.key("fields")
- .listType()
- .noDefaultValue()
- .withDescription(" the iceberg table fields");
-
private String catalogName;
private IcebergCatalogType catalogType;
private String uri;