[seatunnel] branch dev updated: [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using unified format HDFS. (#4871)

liugddx Mon, 14 Aug 2023 02:08:25 -0700

This is an automated email from the ASF dual-hosted git repository.

liugddx pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git



The following commit(s) were added to refs/heads/dev by this push:
     new de9a3243c8 [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using 
unified format HDFS. (#4871)
de9a3243c8 is described below

commit de9a3243c89dbd1cc1d3c477b25745a8dbc3c2c7
Author: lightzhao <[email protected]>
AuthorDate: Mon Aug 14 17:08:00 2023 +0800

    [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using unified format 
HDFS. (#4871)
    
    * Refactor connector-v2 docs using unified format HDFS.
    
    * add data type.
    
    * update.
    
    * add key feature.
    
    * add hdfs_site_path
    
    * 1.add data type.
    2.add hdfs_site_path conf.
    
    * add data type.
    
    * add hdfs site conf.
    
    ---------
    
    Co-authored-by: lightzhao <[email protected]>
    Co-authored-by: liuli <[email protected]>
---
 docs/en/connector-v2/sink/HdfsFile.md   | 326 ++++++++++++--------------------
 docs/en/connector-v2/source/HdfsFile.md | 307 ++++++------------------------
 2 files changed, 185 insertions(+), 448 deletions(-)

diff --git a/docs/en/connector-v2/sink/HdfsFile.md 
b/docs/en/connector-v2/sink/HdfsFile.md
index 34ce19714b..135c5115c2 100644
--- a/docs/en/connector-v2/sink/HdfsFile.md
+++ b/docs/en/connector-v2/sink/HdfsFile.md
@@ -1,20 +1,14 @@
 # HdfsFile
 
-> HDFS file sink connector
+> HDFS File Sink Connector
 
-## Description
-
-Output data to hdfs file
-
-:::tip
-
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
+## Support Those Engines
 
-:::
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
 
-## Key features
+## Key Features
 
 - [x] [exactly-once](../../concept/connector-v2-features.md)
 
@@ -30,183 +24,120 @@ By default, we use 2PC commit to ensure `exactly-once`
 - [x] compress codec
   - [x] lzo
 
-## Options
-
-|               name               |  type   | required |               
default value                |                          remarks                 
         |
-|----------------------------------|---------|----------|--------------------------------------------|-----------------------------------------------------------|
-| fs.defaultFS                     | string  | yes      | -                    
                      |                                                         
  |
-| path                             | string  | yes      | -                    
                      |                                                         
  |
-| hdfs_site_path                   | string  | no       | -                    
                      |                                                         
  |
-| custom_filename                  | boolean | no       | false                
                      | Whether you need custom the filename                    
  |
-| file_name_expression             | string  | no       | "${transactionId}"   
                      | Only used when custom_filename is true                  
  |
-| filename_time_format             | string  | no       | "yyyy.MM.dd"         
                      | Only used when custom_filename is true                  
  |
-| file_format_type                 | string  | no       | "csv"                
                      |                                                         
  |
-| field_delimiter                  | string  | no       | '\001'               
                      | Only used when file_format_type is text                 
  |
-| row_delimiter                    | string  | no       | "\n"                 
                      | Only used when file_format_type is text                 
  |
-| have_partition                   | boolean | no       | false                
                      | Whether you need processing partitions.                 
  |
-| partition_by                     | array   | no       | -                    
                      | Only used then have_partition is true                   
  |
-| partition_dir_expression         | string  | no       | 
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is 
true                     |
-| is_partition_field_write_in_file | boolean | no       | false                
                      | Only used then have_partition is true                   
  |
-| sink_columns                     | array   | no       |                      
                      | When this parameter is empty, all fields are sink 
columns |
-| is_enable_transaction            | boolean | no       | true                 
                      |                                                         
  |
-| batch_size                       | int     | no       | 1000000              
                      |                                                         
  |
-| compress_codec                   | string  | no       | none                 
                      |                                                         
  |
-| kerberos_principal               | string  | no       | -                    
                      |
-| kerberos_keytab_path             | string  | no       | -                    
                      |                                                         
  |
-| compress_codec                   | string  | no       | none                 
                      |                                                         
  |
-| common-options                   | object  | no       | -                    
                      |                                                         
  |
-| max_rows_in_memory               | int     | no       | -                    
                      | Only used when file_format_type is excel.               
  |
-| sheet_name                       | string  | no       | Sheet${Random 
number}                      | Only used when file_format_type is excel.        
         |
-
-### fs.defaultFS [string]
-
-The hadoop cluster address that start with `hdfs://`, for example: 
`hdfs://hadoopcluster`
-
-### path [string]
-
-The target dir path is required.
-
-### hdfs_site_path [string]
-
-The path of `hdfs-site.xml`, used to load ha configuration of namenodes
-
-### custom_filename [boolean]
-
-Whether custom the filename
-
-### file_name_expression [string]
-
-Only used when `custom_filename` is `true`
-
-`file_name_expression` describes the file expression which will be created 
into the `path`. We can add the variable `${now}` or `${uuid}` in the 
`file_name_expression`, like `test_${uuid}_${now}`,
-`${now}` represents the current time, and its format can be defined by 
specifying the option `filename_time_format`.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add 
`${transactionId}_` in the head of the file.
-
-### filename_time_format [string]
-
-Only used when `custom_filename` is `true`
-
-When the format in the `file_name_expression` parameter is `xxxx-${now}` , 
`filename_time_format` can specify the time format of the path, and the default 
value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
-
-| Symbol |    Description     |
-|--------|--------------------|
-| y      | Year               |
-| M      | Month              |
-| d      | Day of month       |
-| H      | Hour in day (0-23) |
-| m      | Minute in hour     |
-| s      | Second in minute   |
-
-### file_format_type [string]
-
-We supported as the following file types:
-
-`text` `json` `csv` `orc` `parquet` `excel`
-
-Please note that, The final file name will end with the file_format_type's 
suffix, the suffix of the text file is `txt`.
-
-### field_delimiter [string]
-
-The separator between columns in a row of data. Only needed by `text` file 
format.
-
-### row_delimiter [string]
-
-The separator between rows in a file. Only needed by `text` file format.
-
-### have_partition [boolean]
-
-Whether you need processing partitions.
-
-### partition_by [array]
-
-Only used when `have_partition` is `true`.
-
-Partition data based on selected fields.
-
-### partition_dir_expression [string]
-
-Only used when `have_partition` is `true`.
-
-If the `partition_by` is specified, we will generate the corresponding 
partition directory based on the partition information, and the final file will 
be placed in the partition directory.
-
-Default `partition_dir_expression` is 
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field 
and `v0` is the value of the first partition field.
-
-### is_partition_field_write_in_file [boolean]
-
-Only used when `have_partition` is `true`.
-
-If `is_partition_field_write_in_file` is `true`, the partition field and the 
value of it will be write into data file.
-
-For example, if you want to write a Hive Data File, Its value should be 
`false`.
-
-### sink_columns [array]
-
-Which columns need be write to file, default value is all of the columns get 
from `Transform` or `Source`.
-The order of the fields determines the order in which the file is actually 
written.
-
-### is_enable_transaction [boolean]
-
-If `is_enable_transaction` is true, we will ensure that data will not be lost 
or duplicated when it is written to the target directory.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add 
`${transactionId}_` in the head of the file.
-
-Only support `true` now.
-
-### batch_size [int]
-
-The maximum number of rows in a file. For SeaTunnel Engine, the number of 
lines in the file is determined by `batch_size` and `checkpoint.interval` 
jointly decide. If the value of `checkpoint.interval` is large enough, sink 
writer will write rows in a file until the rows in the file larger than 
`batch_size`. If `checkpoint.interval` is small, the sink writer will create a 
new file when a new checkpoint trigger.
-
-### compress_codec [string]
-
-The compress codec of files and the details that supported as the following 
shown:
-
-- txt: `lzo` `none`
-- json: `lzo` `none`
-- csv: `lzo` `none`
-- orc: `lzo` `snappy` `lz4` `zlib` `none`
-- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`
-
-Tips: excel type does not support any compression format
-
-### kerberos_principal [string]
-
-The principal of kerberos
-
-### kerberos_keytab_path [string]
-
-The keytab path of kerberos
-
-### common options
-
-Sink plugin common parameters, please refer to [Sink Common 
Options](common-options.md) for details
+## Description
 
-### max_rows_in_memory [int]
+Output data to hdfs file
 
-When File Format is Excel,The maximum number of data items that can be cached 
in the memory.
+## Supported DataSource Info
+
+| Datasource | Supported Versions |
+|------------|--------------------|
+| HdfsFile   | hadoop 2.x and 3.x |
+
+## Sink Options
+
+|               Name               |  Type   | Required |                  
Default                   |                                                     
                                                                                
                                                                                
                  Description                                                   
                                                                                
                  [...]
+|----------------------------------|---------|----------|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| fs.defaultFS                     | string  | yes      | -                    
                      | The hadoop cluster address that start with `hdfs://`, 
for example: `hdfs://hadoopcluster`                                             
                                                                                
                                                                                
                                                                                
                [...]
+| path                             | string  | yes      | -                    
                      | The target dir path is required.                        
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| hdfs_site_path                   | string  | no       | -                    
                      | The path of `hdfs-site.xml`, used to load ha 
configuration of namenodes                                                      
                                                                                
                                                                                
                                                                                
                         [...]
+| custom_filename                  | boolean | no       | false                
                      | Whether you need custom the filename                    
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| file_name_expression             | string  | no       | "${transactionId}"   
                      | Only used when `custom_filename` is 
`true`.`file_name_expression` describes the file expression which will be 
created into the `path`. We can add the variable `${now}` or `${uuid}` in the 
`file_name_expression`, like `test_${uuid}_${now}`,`${now}` represents the 
current time, and its format can be defined by specifying the option 
`filename_time_format`.Please note that, If `is_enable_tr [...]
+| filename_time_format             | string  | no       | "yyyy.MM.dd"         
                      | Only used when `custom_filename` is `true`.When the 
format in the `file_name_expression` parameter is `xxxx-${now}` , 
`filename_time_format` can specify the time format of the path, and the default 
value is `yyyy.MM.dd` . The commonly used time formats are listed as 
follows:[y:Year,M:Month,d:Day of month,H:Hour in day (0-23),m:Minute in 
hour,s:Second in minute]                           [...]
+| file_format_type                 | string  | no       | "csv"                
                      | We supported as the following file types:`text` `json` 
`csv` `orc` `parquet` `excel`.Please note that, The final file name will end 
with the file_format's suffix, the suffix of the text file is `txt`.            
                                                                                
                                                                                
                  [...]
+| field_delimiter                  | string  | no       | '\001'               
                      | Only used when file_format is text,The separator 
between columns in a row of data. Only needed by `text` file format.            
                                                                                
                                                                                
                                                                                
                     [...]
+| row_delimiter                    | string  | no       | "\n"                 
                      | Only used when file_format is text,The separator 
between rows in a file. Only needed by `text` file format.                      
                                                                                
                                                                                
                                                                                
                     [...]
+| have_partition                   | boolean | no       | false                
                      | Whether you need processing partitions.                 
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| partition_by                     | array   | no       | -                    
                      | Only used then have_partition is true,Partition data 
based on selected fields.                                                       
                                                                                
                                                                                
                                                                                
                 [...]
+| partition_dir_expression         | string  | no       | 
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is 
true,If the `partition_by` is specified, we will generate the corresponding 
partition directory based on the partition information, and the final file will 
be placed in the partition directory. Default `partition_dir_expression` is 
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field 
and `v0` is the value of the first partition f [...]
+| is_partition_field_write_in_file | boolean | no       | false                
                      | Only used when `have_partition` is `true`. If 
`is_partition_field_write_in_file` is `true`, the partition field and the value 
of it will be write into data file.For example, if you want to write a Hive 
Data File, Its value should be `false`.                                         
                                                                                
                            [...]
+| sink_columns                     | array   | no       |                      
                      | When this parameter is empty, all fields are sink 
columns.Which columns need be write to file, default value is all of the 
columns get from `Transform` or `Source`. The order of the fields determines 
the order in which the file is actually written.                                
                                                                                
                              [...]
+| is_enable_transaction            | boolean | no       | true                 
                      | If `is_enable_transaction` is true, we will ensure that 
data will not be lost or duplicated when it is written to the target 
directory.Please note that, If `is_enable_transaction` is `true`, we will auto 
add `${transactionId}_` in the head of the file.Only support `true` now.        
                                                                                
                          [...]
+| batch_size                       | int     | no       | 1000000              
                      | The maximum number of rows in a file. For SeaTunnel 
Engine, the number of lines in the file is determined by `batch_size` and 
`checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is 
large enough, sink writer will write rows in a file until the rows in the file 
larger than `batch_size`. If `checkpoint.interval` is small, the sink writer 
will create a new file when  [...]
+| compress_codec                   | string  | no       | none                 
                      | The compress codec of files and the details that 
supported as the following shown:[txt: `lzo` `none`,json: `lzo` `none`,csv: 
`lzo` `none`,orc: `lzo` `snappy` `lz4` `zlib` `none`,parquet: `lzo` `snappy` 
`lz4` `gzip` `brotli` `zstd` `none`].Tips: excel type does not support any 
compression format.                                                             
                                 [...]
+| kerberos_principal               | string  | no       | -                    
                      | The principal of kerberos                               
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| kerberos_keytab_path             | string  | no       | -                    
                      | The keytab path of kerberos                             
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| compress_codec                   | string  | no       | none                 
                      | compress codec                                          
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+| common-options                   | object  | no       | -                    
                      | Sink plugin common parameters, please refer to [Sink 
Common Options](common-options.md) for details                                  
                                                                                
                                                                                
                                                                                
                 [...]
+| max_rows_in_memory               | int     | no       | -                    
                      | Only used when file_format is excel.When File Format is 
Excel,The maximum number of data items that can be cached in the memory.        
                                                                                
                                                                                
                                                                                
              [...]
+| sheet_name                       | string  | no       | Sheet${Random 
number}                      | Only used when file_format is excel.Writer the 
sheet of the workbook                                                           
                                                                                
                                                                                
                                                                                
                       [...]
+
+### Tips
+
+> If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 
2.x. If you use SeaTunnel Engine, It automatically integrated the hadoop jar 
when you download and install SeaTunnel Engine. You can check the jar package 
under ${SEATUNNEL_HOME}/lib to confirm this.
+
+## Task Example
+
+### Simple:
+
+> This example defines a SeaTunnel synchronization task that automatically 
generates data through FakeSource and sends it to Hdfs.
 
-### sheet_name [string]
+```
+# Defining the runtime environment
+env {
+  # You can set flink configuration here
+  execution.parallelism = 1
+  job.mode = "BATCH"
+}
 
-Writer the sheet of the workbook
+source {
+  # This is a example source plugin **only for test and demonstrate the 
feature source plugin**
+  FakeSource {
+    parallelism = 1
+    result_table_name = "fake"
+    row.num = 16
+    schema = {
+      fields {
+        c_map = "map<string, smallint>"
+        c_array = "array<int>"
+        c_string = string
+        c_boolean = boolean
+        c_tinyint = tinyint
+        c_smallint = smallint
+        c_int = int
+        c_bigint = bigint
+        c_float = float
+        c_double = double
+        c_decimal = "decimal(30, 8)"
+        c_bytes = bytes
+        c_date = date
+        c_timestamp = timestamp
+      }
+    }
+  }
+  # If you would like to get more information about how to configure seatunnel 
and see full list of source plugins,
+  # please go to https://seatunnel.apache.org/docs/category/source-v2
+}
 
-## Example
+transform {
+  # If you would like to get more information about how to configure seatunnel 
and see full list of transform plugins,
+    # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
 
-For orc file format simple config
+sink {
+    HdfsFile {
+      fs.defaultFS = "hdfs://hadoopcluster"
+      path = "/tmp/hive/warehouse/test2"
+      file_format = "orc"
+    }
+  # If you would like to get more information about how to configure seatunnel 
and see full list of sink plugins,
+  # please go to https://seatunnel.apache.org/docs/category/sink-v2
+}
+```
 
-```bash
+### For orc file format simple config
 
+```
 HdfsFile {
     fs.defaultFS = "hdfs://hadoopcluster"
     path = "/tmp/hive/warehouse/test2"
-    file_format_type = "orc"
+    file_format = "orc"
 }
-
 ```
 
-For text file format with `have_partition` and `custom_filename` and 
`sink_columns`
-
-```bash
+### For text file format with `have_partition` and `custom_filename` and 
`sink_columns`
 
+```
 HdfsFile {
     fs.defaultFS = "hdfs://hadoopcluster"
     path = "/tmp/hive/warehouse/test2"
@@ -223,13 +154,11 @@ HdfsFile {
     sink_columns = ["name","age"]
     is_enable_transaction = true
 }
-
 ```
 
-For parquet file format with `have_partition` and `custom_filename` and 
`sink_columns`
-
-```bash
+### For parquet file format with `have_partition` and `custom_filename` and 
`sink_columns`
 
+```
 HdfsFile {
     fs.defaultFS = "hdfs://hadoopcluster"
     path = "/tmp/hive/warehouse/test2"
@@ -244,32 +173,27 @@ HdfsFile {
     sink_columns = ["name","age"]
     is_enable_transaction = true
 }
-
 ```
 
-## Changelog
+### For kerberos simple config
 
-### 2.2.0-beta 2022-09-26
-
-- Add HDFS File Sink Connector
-
-### 2.3.0-beta 2022-10-20
-
-- [BugFix] Fix the bug of incorrect path in windows environment 
([2980](https://github.com/apache/seatunnel/pull/2980))
-- [BugFix] Fix filesystem get error 
([3117](https://github.com/apache/seatunnel/pull/3117))
-- [BugFix] Solved the bug of can not parse '\t' as delimiter from config file 
([3083](https://github.com/apache/seatunnel/pull/3083))
-
-### 2.3.0 2022-12-30
-
-- [BugFix] Fixed the following bugs that failed to write data to files 
([3258](https://github.com/apache/seatunnel/pull/3258))
-  - When field from upstream is null it will throw NullPointerException
-  - Sink columns mapping failed
-  - When restore writer from states getting transaction directly failed
+```
+HdfsFile {
+    fs.defaultFS = "hdfs://hadoopcluster"
+    path = "/tmp/hive/warehouse/test2"
+    hdfs_site_path = "/path/to/your/hdfs_site_path"
+    kerberos_principal = "[email protected]"
+    kerberos_keytab_path = "/path/to/your/keytab/file.keytab"
+}
+```
 
-### Next version
+### For compress simple config
 
-- [Improve] Support setting batch size for every file 
([3625](https://github.com/apache/seatunnel/pull/3625))
-- [Improve] Support lzo compression for text in file format 
([3782](https://github.com/apache/seatunnel/pull/3782))
-- [Improve] Support kerberos authentication 
([3840](https://github.com/apache/seatunnel/pull/3840))
-- [Improve] Support file compress 
([3899](https://github.com/apache/seatunnel/pull/3899))
+```
+HdfsFile {
+    fs.defaultFS = "hdfs://hadoopcluster"
+    path = "/tmp/hive/warehouse/test2"
+    compress_codec = "lzo"
+}
+```
 
diff --git a/docs/en/connector-v2/source/HdfsFile.md 
b/docs/en/connector-v2/source/HdfsFile.md
index f479e40a2b..88c1e35f87 100644
--- a/docs/en/connector-v2/source/HdfsFile.md
+++ b/docs/en/connector-v2/source/HdfsFile.md
@@ -1,20 +1,14 @@
 # HdfsFile
 
-> Hdfs file source connector
+> Hdfs File Source Connector
 
-## Description
-
-Read data from hdfs file system.
-
-:::tip
+## Support Those Engines
 
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
 
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
-
-:::
-
-## Key features
+## Key Features
 
 - [x] [batch](../../concept/connector-v2-features.md)
 - [ ] [stream](../../concept/connector-v2-features.md)
@@ -33,238 +27,57 @@ Read all the data in a split in a pollNext call. What 
splits are read will be sa
   - [x] json
   - [x] excel
 
-## Options
-
-|           name            |  type   | required |    default value    |
-|---------------------------|---------|----------|---------------------|
-| path                      | string  | yes      | -                   |
-| file_format_type          | string  | yes      | -                   |
-| fs.defaultFS              | string  | yes      | -                   |
-| read_columns              | list    | yes      | -                   |
-| hdfs_site_path            | string  | no       | -                   |
-| delimiter                 | string  | no       | \001                |
-| parse_partition_from_path | boolean | no       | true                |
-| date_format               | string  | no       | yyyy-MM-dd          |
-| datetime_format           | string  | no       | yyyy-MM-dd HH:mm:ss |
-| time_format               | string  | no       | HH:mm:ss            |
-| kerberos_principal        | string  | no       | -                   |
-| kerberos_keytab_path      | string  | no       | -                   |
-| skip_header_row_number    | long    | no       | 0                   |
-| schema                    | config  | no       | -                   |
-| common-options            |         | no       | -                   |
-| sheet_name                | string  | no       | -                   |
-| file_filter_pattern       | string  | no       | -                   |
-
-### path [string]
-
-The source file path.
-
-### delimiter [string]
-
-Field delimiter, used to tell connector how to slice and dice fields when 
reading text files
-
-default `\001`, the same as hive's default delimiter
-
-### parse_partition_from_path [boolean]
-
-Control whether parse the partition keys and values from file path
-
-For example if you read a file from path 
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
-
-Every record data from file will be added these two fields:
-
-|     name      | age |
-|---------------|-----|
-| tyrantlucifer | 26  |
-
-Tips: **Do not define partition fields in schema option**
-
-### date_format [string]
-
-Date type format, used to tell connector how to convert string to date, 
supported as the following formats:
-
-`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
-
-default `yyyy-MM-dd`
-
-### datetime_format [string]
-
-Datetime type format, used to tell connector how to convert string to 
datetime, supported as the following formats:
-
-`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss` 
`yyyyMMddHHmmss`
-
-default `yyyy-MM-dd HH:mm:ss`
-
-### time_format [string]
-
-Time type format, used to tell connector how to convert string to time, 
supported as the following formats:
-
-`HH:mm:ss` `HH:mm:ss.SSS`
-
-default `HH:mm:ss`
-
-### skip_header_row_number [long]
-
-Skip the first few lines, but only for the txt and csv.
-
-For example, set like following:
-
-`skip_header_row_number = 2`
-
-then SeaTunnel will skip the first 2 lines from source files
-
-### file_format_type [string]
-
-File type, supported as the following file types:
-
-`text` `csv` `parquet` `orc` `json` `excel`
-
-If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
-
-For example:
-
-upstream data is the following:
-
-```json
-
-{"code":  200, "data":  "get success", "success":  true}
-
-```
-
-You can also save multiple pieces of data in one file and split them by 
newline:
-
-```json lines
-
-{"code":  200, "data":  "get success", "success":  true}
-{"code":  300, "data":  "get failed", "success":  false}
-
-```
-
-you should assign schema as the following:
-
-```hocon
-
-schema {
-    fields {
-        code = int
-        data = string
-        success = boolean
-    }
-}
-
-```
-
-connector will generate data as the following:
-
-| code |    data     | success |
-|------|-------------|---------|
-| 200  | get success | true    |
-
-If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
-
-If you assign file type to `text` `csv`, you can choose to specify the schema 
information or not.
+## Description
 
-For example, upstream data is the following:
+Read data from hdfs file system.
 
-```text
+## Supported DataSource Info
 
-tyrantlucifer#26#male
+| Datasource | Supported Versions |
+|------------|--------------------|
+| HdfsFile   | hadoop 2.x and 3.x |
 
-```
+## Source Options
 
-If you do not assign data schema connector will treat the upstream data as the 
following:
+|           Name            |  Type   | Required |       Default       |       
                                                                                
                                                                           
Description                                                                     
                                                                                
             |
+|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path                      | string  | yes      | -                   | The 
source file path.                                                               
                                                                                
                                                                                
                                                                                
          |
+| file_format_type          | string  | yes      | -                   | We 
supported as the following file types:`text` `json` `csv` `orc` `parquet` 
`excel`.Please note that, The final file name will end with the file_format's 
suffix, the suffix of the text file is `txt`.                                   
                                                                                
                   |
+| fs.defaultFS              | string  | yes      | -                   | The 
hadoop cluster address that start with `hdfs://`, for example: 
`hdfs://hadoopcluster`                                                          
                                                                                
                                                                                
                           |
+| read_columns              | list    | yes      | -                   | The 
read column list of the data source, user can use it to implement field 
projection.The file type supported column projection as the following 
shown:[text,json,csv,orc,parquet,excel].Tips: If the user wants to use this 
feature when reading `text` `json` `csv` files, the schema option must be 
configured.                           |
+| hdfs_site_path            | string  | no       | -                   | The 
path of `hdfs-site.xml`, used to load ha configuration of namenodes             
                                                                                
                                                                                
                                                                                
          |
+| delimiter                 | string  | no       | \001                | Field 
delimiter, used to tell connector how to slice and dice fields when reading 
text files. default `\001`, the same as hive's default delimiter                
                                                                                
                                                                                
            |
+| parse_partition_from_path | boolean | no       | true                | 
Control whether parse the partition keys and values from file path. For example 
if you read a file from path 
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every 
record data from file will be added these two 
fields:[name:tyrantlucifer,age:26].Tips:Do not define partition fields in 
schema option.            |
+| date_format               | string  | no       | yyyy-MM-dd          | Date 
type format, used to tell connector how to convert string to date, supported as 
the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` default 
`yyyy-MM-dd`.Date type format, used to tell connector how to convert string to 
date, supported as the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` 
default `yyyy-MM-dd` |
+| datetime_format           | string  | no       | yyyy-MM-dd HH:mm:ss | 
Datetime type format, used to tell connector how to convert string to datetime, 
supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` 
`yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` .default `yyyy-MM-dd HH:mm:ss`           
                                                                                
               |
+| time_format               | string  | no       | HH:mm:ss            | Time 
type format, used to tell connector how to convert string to time, supported as 
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`.default `HH:mm:ss`              
                                                                                
                                                                                
         |
+| kerberos_principal        | string  | no       | -                   | The 
principal of kerberos                                                           
                                                                                
                                                                                
                                                                                
          |
+| kerberos_keytab_path      | string  | no       | -                   | The 
keytab path of kerberos                                                         
                                                                                
                                                                                
                                                                                
          |
+| skip_header_row_number    | long    | no       | 0                   | Skip 
the first few lines, but only for the txt and csv.For example, set like 
following:`skip_header_row_number = 2`.then Seatunnel will skip the first 2 
lines from source files                                                         
                                                                                
                     |
+| schema                    | config  | no       | -                   | the 
schema fields of upstream data                                                  
                                                                                
                                                                                
                                                                                
          |
+| common-options            |         | no       | -                   | 
Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.                                        
                                                                                
                                                                                
                              |
+| sheet_name                | string  | no       | -                   | 
Reader the sheet of the workbook,Only used when file_format is excel.           
                                                                                
                                                                                
                                                                                
              |
 
-|        content        |
-|-----------------------|
-| tyrantlucifer#26#male |
+### Tips
 
-If you assign data schema, you should also assign the option `delimiter` too 
except CSV file type
+> If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 
2.x. If you use SeaTunnel Engine, It automatically integrated the hadoop jar 
when you download and install SeaTunnel Engine. You can check the jar package 
under ${SEATUNNEL_HOME}/lib to confirm this.
 
-you should assign schema and delimiter as the following:
+## Task Example
 
-```hocon
+### Simple:
 
-delimiter = "#"
-schema {
-    fields {
-        name = string
-        age = int
-        gender = string 
-    }
-}
+> This example defines a SeaTunnel synchronization task that  read data from 
Hdfs and sends it to Hdfs.
 
 ```
-
-connector will generate data as the following:
-
-|     name      | age | gender |
-|---------------|-----|--------|
-| tyrantlucifer | 26  | male   |
-
-### fs.defaultFS [string]
-
-Hdfs cluster address.
-
-### hdfs_site_path [string]
-
-The path of `hdfs-site.xml`, used to load ha configuration of namenodes
-
-### kerberos_principal [string]
-
-The principal of kerberos
-
-### kerberos_keytab_path [string]
-
-The keytab path of kerberos
-
-### schema [Config]
-
-#### fields [Config]
-
-the schema fields of upstream data
-
-### read_columns [list]
-
-The read column list of the data source, user can use it to implement field 
projection.
-
-The file type supported column projection as the following shown:
-
-- text
-- json
-- csv
-- orc
-- parquet
-- excel
-
-**Tips: If the user wants to use this feature when reading `text` `json` `csv` 
files, the schema option must be configured**
-
-### common options
-
-Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.
-
-### sheet_name [string]
-
-Reader the sheet of the workbook,Only used when file_format_type is excel.
-
-### file_filter_pattern [string]
-
-Filter pattern, which used for filtering files.
-
-## Example
-
-```hocon
-
-HdfsFile {
-  path = "/apps/hive/demo/student"
-  file_format_type = "parquet"
-  fs.defaultFS = "hdfs://namenode001"
+# Defining the runtime environment
+env {
+  # You can set flink configuration here
+  execution.parallelism = 1
+  job.mode = "BATCH"
 }
 
-```
-
-```hocon
-
-HdfsFile {
+source {
+  HdfsFile {
   schema {
     fields {
       name = string
@@ -274,24 +87,24 @@ HdfsFile {
   path = "/apps/hive/demo/student"
   type = "json"
   fs.defaultFS = "hdfs://namenode001"
+  }
+  # If you would like to get more information about how to configure seatunnel 
and see full list of source plugins,
+  # please go to https://seatunnel.apache.org/docs/category/source-v2
 }
 
-```
-
-## Changelog
-
-### 2.2.0-beta 2022-09-26
-
-- Add HDFS File Source Connector
-
-### 2.3.0-beta 2022-10-20
-
-- [BugFix] Fix the bug of incorrect path in windows environment 
([2980](https://github.com/apache/seatunnel/pull/2980))
-- [Improve] Support extract partition from SeaTunnelRow fields 
([3085](https://github.com/apache/seatunnel/pull/3085))
-- [Improve] Support parse field from file path 
([2985](https://github.com/apache/seatunnel/pull/2985))
-
-### next version
+transform {
+  # If you would like to get more information about how to configure seatunnel 
and see full list of transform plugins,
+    # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
 
-- [Improve] Support skip header for csv and txt files 
([3900](https://github.com/apache/seatunnel/pull/3840))
-- [Improve] Support kerberos authentication 
([3840](https://github.com/apache/seatunnel/pull/3840))
+sink {
+    HdfsFile {
+      fs.defaultFS = "hdfs://hadoopcluster"
+      path = "/tmp/hive/warehouse/test2"
+      file_format = "orc"
+    }
+  # If you would like to get more information about how to configure seatunnel 
and see full list of sink plugins,
+  # please go to https://seatunnel.apache.org/docs/category/sink-v2
+}
+```

[seatunnel] branch dev updated: [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using unified format HDFS. (#4871)

Reply via email to