[seatunnel] branch dev updated: [Doc][Iceberg] Improved iceberg documentation (#5335)

liugddx Wed, 23 Aug 2023 07:55:22 -0700

This is an automated email from the ASF dual-hosted git repository.

liugddx pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git



The following commit(s) were added to refs/heads/dev by this push:
     new 659a68a0be [Doc][Iceberg] Improved iceberg documentation (#5335)
659a68a0be is described below

commit 659a68a0be295350891a844d8d903bdbbc3c5b4f
Author: Carl-Zhou-CN <[email protected]>
AuthorDate: Wed Aug 23 22:55:08 2023 +0800

    [Doc][Iceberg] Improved iceberg documentation (#5335)
    
    * [Doc][Iceberg] Improved iceberg documentation
    
    * [Doc][Iceberg] Improved iceberg documentation
    
    ---------
    
    Co-authored-by: zhouyao <[email protected]>
---
 docs/en/connector-v2/source/Iceberg.md             | 222 ++++++++++-----------
 .../seatunnel/iceberg/config/CommonConfig.java     |   7 -
 2 files changed, 104 insertions(+), 125 deletions(-)

diff --git a/docs/en/connector-v2/source/Iceberg.md 
b/docs/en/connector-v2/source/Iceberg.md
index 6a42ee0ddd..b6d3924b95 100644
--- a/docs/en/connector-v2/source/Iceberg.md
+++ b/docs/en/connector-v2/source/Iceberg.md
@@ -2,9 +2,15 @@
 
 > Apache Iceberg source connector
 
-## Description
+## Support Iceberg Version
 
-Source connector for Apache Iceberg. It can support batch and stream mode.
+- 0.14.0
+
+## Support Those Engines
+
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
 
 ## Key features
 
@@ -22,126 +28,120 @@ Source connector for Apache Iceberg. It can support batch 
and stream mode.
   - [x] hadoop(2.7.1 , 2.7.5 , 3.1.3)
   - [x] hive(2.3.9 , 3.1.2)
 
-## Options
-
-|           name           |  type   | required |    default value     |
-|--------------------------|---------|----------|----------------------|
-| catalog_name             | string  | yes      | -                    |
-| catalog_type             | string  | yes      | -                    |
-| uri                      | string  | no       | -                    |
-| warehouse                | string  | yes      | -                    |
-| namespace                | string  | yes      | -                    |
-| table                    | string  | yes      | -                    |
-| schema                   | config  | no       | -                    |
-| case_sensitive           | boolean | no       | false                |
-| start_snapshot_timestamp | long    | no       | -                    |
-| start_snapshot_id        | long    | no       | -                    |
-| end_snapshot_id          | long    | no       | -                    |
-| use_snapshot_id          | long    | no       | -                    |
-| use_snapshot_timestamp   | long    | no       | -                    |
-| stream_scan_strategy     | enum    | no       | FROM_LATEST_SNAPSHOT |
-| common-options           |         | no       | -                    |
-
-### catalog_name [string]
-
-User-specified catalog name.
-
-### catalog_type [string]
-
-The optional values are:
-- hive: The hive metastore catalog.
-- hadoop: The hadoop catalog.
-
-### uri [string]
-
-The Hive metastore’s thrift URI.
-
-### warehouse [string]
-
-The location to store metadata files and data files.
-
-### namespace [string]
-
-The iceberg database name in the backend catalog.
-
-### table [string]
-
-The iceberg table name in the backend catalog.
-
-### case_sensitive [boolean]
+## Description
 
-If data columns where selected via schema [config], controls whether the match 
to the schema will be done with case sensitivity.
+Source connector for Apache Iceberg. It can support batch and stream mode.
 
-### schema [config]
+## Supported DataSource Info
 
-#### fields [Config]
+| Datasource |      Dependent      |                                   Maven   
                                |
+|------------|---------------------|---------------------------------------------------------------------------|
+| Iceberg    | flink-shaded-hadoop | 
[Download](https://mvnrepository.com/search?q=flink-shaded-hadoop-)       |
+| Iceberg    | hive-exec           | 
[Download](https://mvnrepository.com/artifact/org.apache.hive/hive-exec)  |
+| Iceberg    | libfb303            | 
[Download](https://mvnrepository.com/artifact/org.apache.thrift/libfb303) |
 
-Use projection to select data columns and columns order.
+## Database Dependency
 
-e.g.
+> In order to be compatible with different versions of Hadoop and Hive, the 
scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are 
provided, so if you use the Flink engine, first you may need to add the 
following Jar packages to <FLINK_HOME>/lib directory, if you are using the 
Spark engine and integrated with Hadoop, then you do not need to add the 
following Jar packages.
 
 ```
-schema {
-    fields {
-      f2 = "boolean"
-      f1 = "bigint"
-      f3 = "int"
-      f4 = "bigint"
-    }
-}
+flink-shaded-hadoop-x-xxx.jar
+hive-exec-xxx.jar
+libfb303-xxx.jar
 ```
 
-### start_snapshot_id [long]
-
-Instructs this scan to look for changes starting from a particular snapshot 
(exclusive).
-
-### start_snapshot_timestamp [long]
-
-Instructs this scan to look for changes starting from  the most recent 
snapshot for the table as of the timestamp. timestamp – the timestamp in millis 
since the Unix epoch
-
-### end_snapshot_id [long]
-
-Instructs this scan to look for changes up to a particular snapshot 
(inclusive).
-
-### use_snapshot_id [long]
-
-Instructs this scan to look for use the given snapshot ID.
-
-### use_snapshot_timestamp [long]
-
-Instructs this scan to look for use the most recent snapshot as of the given 
time in milliseconds. timestamp – the timestamp in millis since the Unix epoch
-
-### stream_scan_strategy [enum]
-
-Starting strategy for stream mode execution, Default to use 
`FROM_LATEST_SNAPSHOT` if don’t specify any value.
-The optional values are:
-- TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the 
incremental mode.
-- FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot 
inclusive.
-- FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot 
inclusive.
-- FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id 
inclusive.
-- FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a 
specific timestamp inclusive.
-
-### common options
-
-Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.
-
-## Example
-
-simple
+> Some versions of the hive-exec package do not have libfb303-xxx.jar, so you 
also need to manually import the Jar package.
+
+## Data Type Mapping
+
+| Iceberg Data type | SeaTunnel Data type |
+|-------------------|---------------------|
+| BOOLEAN           | BOOLEAN             |
+| INTEGER           | INT                 |
+| LONG              | BIGINT              |
+| FLOAT             | FLOAT               |
+| DOUBLE            | DOUBLE              |
+| DATE              | DATE                |
+| TIME              | TIME                |
+| TIMESTAMP         | TIMESTAMP           |
+| STRING            | STRING              |
+| FIXED<br/>BINARY  | BYTES               |
+| DECIMAL           | DECIMAL             |
+| STRUCT            | ROW                 |
+| LIST              | ARRAY               |
+| MAP               | MAP                 |
+
+## Source Options
+
+|           Name           |  Type   | Required |       Default        |       
                                                                                
                                                                                
                                                                                
                                               Description                      
                                                                                
              [...]
+|--------------------------|---------|----------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| catalog_name             | string  | yes      | -                    | 
User-specified catalog name.                                                    
                                                                                
                                                                                
                                                                                
                                                                                
                    [...]
+| catalog_type             | string  | yes      | -                    | The 
optional values are: hive(The hive metastore catalog),hadoop(The hadoop 
catalog)                                                                        
                                                                                
                                                                                
                                                                                
                        [...]
+| uri                      | string  | no       | -                    | The 
Hive metastore’s thrift URI.                                                    
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| warehouse                | string  | yes      | -                    | The 
location to store metadata files and data files.                                
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| namespace                | string  | yes      | -                    | The 
iceberg database name in the backend catalog.                                   
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| table                    | string  | yes      | -                    | The 
iceberg table name in the backend catalog.                                      
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| schema                   | config  | no       | -                    | Use 
projection to select data columns and columns order.                            
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| case_sensitive           | boolean | no       | false                | If 
data columns where selected via schema [config], controls whether the match to 
the schema will be done with case sensitivity.                                  
                                                                                
                                                                                
                                                                                
                  [...]
+| start_snapshot_timestamp | long    | no       | -                    | 
Instructs this scan to look for changes starting from  the most recent snapshot 
for the table as of the timestamp. <br/>timestamp – the timestamp in millis 
since the Unix epoch                                                            
                                                                                
                                                                                
                        [...]
+| start_snapshot_id        | long    | no       | -                    | 
Instructs this scan to look for changes starting from a particular snapshot 
(exclusive).                                                                    
                                                                                
                                                                                
                                                                                
                        [...]
+| end_snapshot_id          | long    | no       | -                    | 
Instructs this scan to look for changes up to a particular snapshot 
(inclusive).                                                                    
                                                                                
                                                                                
                                                                                
                                [...]
+| use_snapshot_id          | long    | no       | -                    | 
Instructs this scan to look for use the given snapshot ID.                      
                                                                                
                                                                                
                                                                                
                                                                                
                    [...]
+| use_snapshot_timestamp   | long    | no       | -                    | 
Instructs this scan to look for use the most recent snapshot as of the given 
time in milliseconds. timestamp – the timestamp in millis since the Unix epoch  
                                                                                
                                                                                
                                                                                
                       [...]
+| stream_scan_strategy     | enum    | no       | FROM_LATEST_SNAPSHOT | 
Starting strategy for stream mode execution, Default to use 
`FROM_LATEST_SNAPSHOT` if don’t specify any value,The optional values 
are:<br/>TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to 
the incremental mode.<br/>FROM_LATEST_SNAPSHOT: Start incremental mode from the 
latest snapshot inclusive.<br/>FROM_EARLIEST_SNAPSHOT: Start incremental mode 
from the earliest snapshot inclusive.<br/>FROM_SNAPSHO [...]
+| common-options           |         | no       | -                    | 
Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.                                        
                                                                                
                                                                                
                                                                                
                                    [...]
+
+## Task Example
+
+### Simple:
 
 ```hocon
+env {
+  execution.parallelism = 2
+  job.mode = "BATCH"
+}
+
 source {
   Iceberg {
+    schema {
+      fields {
+        f2 = "boolean"
+        f1 = "bigint"
+        f3 = "int"
+        f4 = "bigint"
+        f5 = "float"
+        f6 = "double"
+        f7 = "date"
+        f9 = "timestamp"
+        f10 = "timestamp"
+        f11 = "string"
+        f12 = "bytes"
+        f13 = "bytes"
+        f14 = "decimal(19,9)"
+        f15 = "array<int>"
+        f16 = "map<string, int>"
+      }
+    }
     catalog_name = "seatunnel"
     catalog_type = "hadoop"
-    warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
-    namespace = "your_iceberg_database"
-    table = "your_iceberg_table"
+    warehouse = "file:///tmp/seatunnel/iceberg/hadoop/"
+    namespace = "database1"
+    table = "source"
+    result_table_name = "iceberg"
+  }
+}
+
+transform {
+}
+
+sink {
+  Console {
+    source_table_name = "iceberg"
   }
 }
 ```
 
-Or
+### Hive Catalog:
 
 ```hocon
 source {
@@ -156,7 +156,7 @@ source {
 }
 ```
 
-column projection
+### Column Projection:
 
 ```hocon
 source {
@@ -179,20 +179,6 @@ source {
 }
 ```
 
-:::tip
-
-In order to be compatible with different versions of Hadoop and Hive, the 
scope of hive-exec and flink-shaded-hadoop-2 in the project pom file are 
provided, so if you use the Flink engine, first you may need to add the 
following Jar packages to <FLINK_HOME>/lib directory, if you are using the 
Spark engine and integrated with Hadoop, then you do not need to add the 
following Jar packages.
-
-:::
-
-```
-flink-shaded-hadoop-x-xxx.jar
-hive-exec-xxx.jar
-libfb303-xxx.jar
-```
-
-Some versions of the hive-exec package do not have libfb303-xxx.jar, so you 
also need to manually import the Jar package.
-
 ## Changelog
 
 ### 2.2.0-beta 2022-09-26
diff --git 
a/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
 
b/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
index ac9f8c12bb..2f893da092 100644
--- 
a/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
+++ 
b/seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/config/CommonConfig.java
@@ -26,7 +26,6 @@ import lombok.Getter;
 import lombok.ToString;
 
 import java.io.Serializable;
-import java.util.List;
 
 import static 
org.apache.seatunnel.connectors.seatunnel.iceberg.config.IcebergCatalogType.HADOOP;
 import static 
org.apache.seatunnel.connectors.seatunnel.iceberg.config.IcebergCatalogType.HIVE;
@@ -80,12 +79,6 @@ public class CommonConfig implements Serializable {
                     .defaultValue(false)
                     .withDescription(" the iceberg case_sensitive");
 
-    public static final Option<List<String>> KEY_FIELDS =
-            Options.key("fields")
-                    .listType()
-                    .noDefaultValue()
-                    .withDescription(" the iceberg table fields");
-
     private String catalogName;
     private IcebergCatalogType catalogType;
     private String uri;

[seatunnel] branch dev updated: [Doc][Iceberg] Improved iceberg documentation (#5335)

Reply via email to