(seatunnel) branch dev updated: [Hotfix][Oss File Connector] fix oss connector can not run bug (#6010)

wanghailin Tue, 19 Dec 2023 05:09:24 -0800

This is an automated email from the ASF dual-hosted git repository.

wanghailin pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git



The following commit(s) were added to refs/heads/dev by this push:
     new 755bc2a730 [Hotfix][Oss File Connector] fix oss connector can not run 
bug (#6010)
755bc2a730 is described below

commit 755bc2a7307c5ff590c5a584e528405e88ef8aff
Author: Eric <[email protected]>
AuthorDate: Tue Dec 19 21:07:50 2023 +0800

    [Hotfix][Oss File Connector] fix oss connector can not run bug (#6010)
---
 docs/en/connector-v2/sink/OssFile.md               | 254 +++++++++++++++++----
 docs/en/connector-v2/source/OssFile.md             | 193 ++++++++++------
 .../connector-file/connector-file-oss/pom.xml      |  23 +-
 seatunnel-dist/pom.xml                             |  30 ++-
 4 files changed, 380 insertions(+), 120 deletions(-)

diff --git a/docs/en/connector-v2/sink/OssFile.md 
b/docs/en/connector-v2/sink/OssFile.md
index 3604748477..4bfe6cc9bc 100644
--- a/docs/en/connector-v2/sink/OssFile.md
+++ b/docs/en/connector-v2/sink/OssFile.md
@@ -8,6 +8,17 @@
 > Flink<br/>
 > SeaTunnel Zeta<br/>
 
+## Usage Dependency
+
+### For Spark/Flink Engine
+
+1. You must ensure your spark/flink cluster already integrated hadoop. The 
tested hadoop version is 2.x.
+2. You must ensure `hadoop-aliyun-xx.jar`, `aliyun-sdk-oss-xx.jar` and 
`jdom-xx.jar` in `${SEATUNNEL_HOME}/plugins/` dir and the version of 
`hadoop-aliyun` jar need equals your hadoop version which used in spark/flink 
and `aliyun-sdk-oss-xx.jar` and `jdom-xx.jar` version needs to be the version 
corresponding to the `hadoop-aliyun` version. Eg: `hadoop-aliyun-3.1.4.jar` 
dependency `aliyun-sdk-oss-3.4.1.jar` and `jdom-1.1.jar`.
+
+### For SeaTunnel Zeta Engine
+
+1. You must ensure `seatunnel-hadoop3-3.1.4-uber.jar`, 
`aliyun-sdk-oss-3.4.1.jar`, `hadoop-aliyun-3.1.4.jar` and `jdom-1.1.jar` in 
`${SEATUNNEL_HOME}/lib/` dir.
+
 ## Key features
 
 - [x] [exactly-once](../../concept/connector-v2-features.md)
@@ -22,61 +33,114 @@ By default, we use 2PC commit to ensure `exactly-once`
   - [x] json
   - [x] excel
 
-## Description
+## Data Type Mapping
 
-Output data to oss file system.
+If write to `csv`, `text` file type, All column will be string.
+
+### Orc File Type
+
+| SeaTunnel Data type  |     Orc Data type     |
+|----------------------|-----------------------|
+| STRING               | STRING                |
+| BOOLEAN              | BOOLEAN               |
+| TINYINT              | BYTE                  |
+| SMALLINT             | SHORT                 |
+| INT                  | INT                   |
+| BIGINT               | LONG                  |
+| FLOAT                | FLOAT                 |
+| FLOAT                | FLOAT                 |
+| DOUBLE               | DOUBLE                |
+| DECIMAL              | DECIMAL               |
+| BYTES                | BINARY                |
+| DATE                 | DATE                  |
+| TIME <br/> TIMESTAMP | TIMESTAMP             |
+| ROW                  | STRUCT                |
+| NULL                 | UNSUPPORTED DATA TYPE |
+| ARRAY                | LIST                  |
+| Map                  | Map                   |
+
+### Parquet File Type
+
+| SeaTunnel Data type  |   Parquet Data type   |
+|----------------------|-----------------------|
+| STRING               | STRING                |
+| BOOLEAN              | BOOLEAN               |
+| TINYINT              | INT_8                 |
+| SMALLINT             | INT_16                |
+| INT                  | INT32                 |
+| BIGINT               | INT64                 |
+| FLOAT                | FLOAT                 |
+| FLOAT                | FLOAT                 |
+| DOUBLE               | DOUBLE                |
+| DECIMAL              | DECIMAL               |
+| BYTES                | BINARY                |
+| DATE                 | DATE                  |
+| TIME <br/> TIMESTAMP | TIMESTAMP_MILLIS      |
+| ROW                  | GroupType             |
+| NULL                 | UNSUPPORTED DATA TYPE |
+| ARRAY                | LIST                  |
+| Map                  | Map                   |
 
-## Supported DataSource Info
+## Options
 
-In order to use the OssFile connector, the following dependencies are required.
-They can be downloaded via install-plugin.sh or from the Maven central 
repository.
+|               name               |  type   | required |               
default value                |                                                  
    remarks                                                      |
+|----------------------------------|---------|----------|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
+| path                             | string  | yes      | The oss path to 
write file in.             |                                                    
                                                               |
+| tmp_path                         | string  | no       | /tmp/seatunnel       
                      | The result file will write to a tmp path first and then 
use `mv` to submit tmp dir to target dir. Need a OSS dir. |
+| bucket                           | string  | yes      | -                    
                      |                                                         
                                                          |
+| access_key                       | string  | yes      | -                    
                      |                                                         
                                                          |
+| access_secret                    | string  | yes      | -                    
                      |                                                         
                                                          |
+| endpoint                         | string  | yes      | -                    
                      |                                                         
                                                          |
+| custom_filename                  | boolean | no       | false                
                      | Whether you need custom the filename                    
                                                          |
+| file_name_expression             | string  | no       | "${transactionId}"   
                      | Only used when custom_filename is true                  
                                                          |
+| filename_time_format             | string  | no       | "yyyy.MM.dd"         
                      | Only used when custom_filename is true                  
                                                          |
+| file_format_type                 | string  | no       | "csv"                
                      |                                                         
                                                          |
+| field_delimiter                  | string  | no       | '\001'               
                      | Only used when file_format_type is text                 
                                                          |
+| row_delimiter                    | string  | no       | "\n"                 
                      | Only used when file_format_type is text                 
                                                          |
+| have_partition                   | boolean | no       | false                
                      | Whether you need processing partitions.                 
                                                          |
+| partition_by                     | array   | no       | -                    
                      | Only used then have_partition is true                   
                                                          |
+| partition_dir_expression         | string  | no       | 
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is 
true                                                                            
 |
+| is_partition_field_write_in_file | boolean | no       | false                
                      | Only used then have_partition is true                   
                                                          |
+| sink_columns                     | array   | no       |                      
                      | When this parameter is empty, all fields are sink 
columns                                                         |
+| is_enable_transaction            | boolean | no       | true                 
                      |                                                         
                                                          |
+| batch_size                       | int     | no       | 1000000              
                      |                                                         
                                                          |
+| compress_codec                   | string  | no       | none                 
                      |                                                         
                                                          |
+| common-options                   | object  | no       | -                    
                      |                                                         
                                                          |
+| max_rows_in_memory               | int     | no       | -                    
                      | Only used when file_format_type is excel.               
                                                          |
+| sheet_name                       | string  | no       | Sheet${Random 
number}                      | Only used when file_format_type is excel.        
                                                                 |
 
-| Datasource | Supported Versions |                                       
Dependency                                       |
-|------------|--------------------|----------------------------------------------------------------------------------------|
-| OssFile    | universal          | 
[Download](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-file-oss)
 |
+### path [string]
 
-:::tip
+The target dir path is required.
 
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+### bucket [string]
 
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
+The bucket address of oss file system, for example: 
`oss://tyrantlucifer-image-bed`
 
-We made some trade-offs in order to support more file types, so we used the 
HDFS protocol for internal access to OSS and this connector need some hadoop 
dependencies.
-It only supports hadoop version **2.9.X+**.
+### access_key [string]
 
-:::
+The access key of oss file system.
 
-## Data Type Mapping
+### access_secret [string]
 
-SeaTunnel will write the data into the file in String format according to the 
SeaTunnel data type and file_format_type.
+The access secret of oss file system.
 
-## Options
+### endpoint [string]
+
+The endpoint of oss file system.
+
+### custom_filename [boolean]
+
+Whether custom the filename
+
+### file_name_expression [string]
+
+Only used when `custom_filename` is `true`
+
+`file_name_expression` describes the file expression which will be created 
into the `path`. We can add the variable `${now}` or `${uuid}` in the 
`file_name_expression`, like `test_${uuid}_${now}`,
+`${now}` represents the current time, and its format can be defined by 
specifying the option `filename_time_format`.
 
-|               Name               |  Type   | Required |               
Default value                |                                                  
                                                                                
                                                                                
                            Description                                         
                                                                                
                     [...]
-|----------------------------------|---------|----------|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
-| path                             | String  | Yes      | -                    
                      | The target dir path is required.                        
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| tmp_path                         | string  | no       | /tmp/seatunnel       
                      | The result file will write to a tmp path first and then 
use `mv` to submit tmp dir to target dir. Need a OSS dir.                       
                                                                                
                                                                                
                                                                                
              [...]
-| bucket                           | String  | Yes      | -                    
                      | The bucket address of oss file system, for example: 
`oss://tyrantlucifer-image-bed`                                                 
                                                                                
                                                                                
                                                                                
                  [...]
-| access_key                       | String  | No       | -                    
                      | The access key of oss file system.                      
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| access_secret                    | String  | No       | -                    
                      | The access secret of oss file system.                   
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| endpoint                         | String  | Yes      | -                    
                      | The endpoint of oss file system.                        
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| custom_filename                  | Boolean | No       | false                
                      | Whether you need custom the filename                    
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| file_name_expression             | String  | No       | "${transactionId}"   
                      | Only used when `custom_filename` is `true`. <br/> 
`file_name_expression` describes the file expression which will be created into 
the `path`. We can add the variable `${Now}` or `${uuid}` in the 
`file_name_expression`, like `test_${uuid}_${Now}`, `${Now}` represents the 
current time, and its format can be defined by specifying the option 
`filename_time_format`. <br/>Please Note that, If [...]
-| filename_time_format             | String  | No       | "yyyy.MM.dd"         
                      | Please check #filename_time_format below                
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| file_format_type                 | String  | No       | "csv"                
                      | We supported as the following file types: <br/> `text` 
`json` `csv` `orc` `parquet` `excel` <br/> Please Note that, The final file 
name will end with the file_format's suffix, the suffix of the text file is 
`txt`.                                                                          
                                                                                
                       [...]
-| field_delimiter                  | String  | No       | '\001'               
                      | The separator between columns in a row of data. Only 
needed by `text` file format.                                                   
                                                                                
                                                                                
                                                                                
                 [...]
-| row_delimiter                    | String  | No       | "\n"                 
                      | The separator between rows in a file. Only needed by 
`text` file format.                                                             
                                                                                
                                                                                
                                                                                
                 [...]
-| have_partition                   | Boolean | No       | false                
                      | Whether you need processing partitions.                 
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-| partition_by                     | Array   | No       | -                    
                      | Only used when `have_partition` is `true`. <br/> 
Partition data based on selected fields.                                        
                                                                                
                                                                                
                                                                                
                     [...]
-| partition_dir_expression         | String  | No       | 
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used when `have_partition` is 
`true`. <br/> If the `partition_by` is specified, we will generate the 
corresponding partition directory based on the partition information, and the 
final file will be placed in the partition directory. <br/> Default 
`partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` 
is the first partition field and `v0` is the value of the  [...]
-| is_partition_field_write_in_file | Boolean | No       | false                
                      | Only used when `have_partition` is `true`. <br/> If 
`is_partition_field_write_in_file` is `true`, the partition field and the value 
of it will be write into data file. <br/> For example, if you want to write a 
Hive Data File, Its value should be `false`.                                    
                                                                                
                    [...]
-| sink_columns                     | Array   | No       |                      
                      | Which columns need be written to file, default value is 
all the columns get from `Transform` or `Source`. <br/> The order of the fields 
determines the order in which the file is actually written.                     
                                                                                
                                                                                
              [...]
-| is_enable_transaction            | Boolean | No       | true                 
                      | If `is_enable_transaction` is true, we will ensure that 
data will Not be lost or duplicated when it is written to the target directory. 
<br/> Please Note that, If `is_enable_transaction` is `true`, we will auto add 
`${transactionId}_` in the head of the file. <br/> Only support `true` Now.     
                                                                                
               [...]
-| batch_size                       | Int     | No       | 1000000              
                      | The maximum number of rows in a file. For SeaTunnel 
Engine, the number of lines in the file is determined by `batch_size` and 
`checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is 
large eNough, sink writer will write rows in a file until the rows in the file 
larger than `batch_size`. If `checkpoint.interval` is small, the sink writer 
will create a new file when  [...]
-| compress_codec                   | String  | No       | None                 
                      | The compress codec of files and the details that 
supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo` 
`None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib` 
`None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None` 
<br/> Tips: excel type does Not support any compression format                  
                                [...]
-| max_rows_in_memory               | Int     | No       | -                    
                      | When File Format is Excel,The maximum number of data 
items that can be cached in the memory.                                         
                                                                                
                                                                                
                                                                                
                 [...]
-| sheet_name                       | String  | No       | Sheet${Random 
number}                      | Writer the sheet of the workbook                 
                                                                                
                                                                                
                                                                                
                                                                                
                     [...]
-| common-options                   | Config  | No       | -                    
                      | Sink plugin common parameters, please refer to [Sink 
Common Options](common-options.md) for details.                                 
                                                                                
                                                                                
                                                                                
                 [...]
+Please note that, If `is_enable_transaction` is `true`, we will auto add 
`${transactionId}_` in the head of the file.
 
 ### filename_time_format [String]
 
@@ -93,7 +157,90 @@ When the format in the `file_name_expression` parameter is 
`xxxx-${Now}` , `file
 | m      | Minute in hour     |
 | s      | Second in minute   |
 
-## How to Create a Oss Data Synchronization Jobs
+### file_format_type [string]
+
+We supported as the following file types:
+
+`text` `json` `csv` `orc` `parquet` `excel`
+
+Please note that, The final file name will end with the file_format_type's 
suffix, the suffix of the text file is `txt`.
+
+### field_delimiter [string]
+
+The separator between columns in a row of data. Only needed by `text` file 
format.
+
+### row_delimiter [string]
+
+The separator between rows in a file. Only needed by `text` file format.
+
+### have_partition [boolean]
+
+Whether you need processing partitions.
+
+### partition_by [array]
+
+Only used when `have_partition` is `true`.
+
+Partition data based on selected fields.
+
+### partition_dir_expression [string]
+
+Only used when `have_partition` is `true`.
+
+If the `partition_by` is specified, we will generate the corresponding 
partition directory based on the partition information, and the final file will 
be placed in the partition directory.
+
+Default `partition_dir_expression` is 
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field 
and `v0` is the value of the first partition field.
+
+### is_partition_field_write_in_file [boolean]
+
+Only used when `have_partition` is `true`.
+
+If `is_partition_field_write_in_file` is `true`, the partition field and the 
value of it will be write into data file.
+
+For example, if you want to write a Hive Data File, Its value should be 
`false`.
+
+### sink_columns [array]
+
+Which columns need be written to file, default value is all the columns get 
from `Transform` or `Source`.
+The order of the fields determines the order in which the file is actually 
written.
+
+### is_enable_transaction [boolean]
+
+If `is_enable_transaction` is true, we will ensure that data will not be lost 
or duplicated when it is written to the target directory.
+
+Please note that, If `is_enable_transaction` is `true`, we will auto add 
`${transactionId}_` in the head of the file.
+
+Only support `true` now.
+
+### batch_size [int]
+
+The maximum number of rows in a file. For SeaTunnel Engine, the number of 
lines in the file is determined by `batch_size` and `checkpoint.interval` 
jointly decide. If the value of `checkpoint.interval` is large enough, sink 
writer will write rows in a file until the rows in the file larger than 
`batch_size`. If `checkpoint.interval` is small, the sink writer will create a 
new file when a new checkpoint trigger.
+
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following 
shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc: `lzo` `snappy` `lz4` `zlib` `none`
+- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`
+
+Tips: excel type does not support any compression format
+
+### common options
+
+Sink plugin common parameters, please refer to [Sink Common 
Options](common-options.md) for details.
+
+### max_rows_in_memory [int]
+
+When File Format is Excel,The maximum number of data items that can be cached 
in the memory.
+
+### sheet_name [string]
+
+Writer the sheet of the workbook
+
+## How to Create an Oss Data Synchronization Jobs
 
 The following example demonstrates how to create a data synchronization job 
that reads data from Fake Source and writes it to the Oss:
 
@@ -215,6 +362,27 @@ sink {
 }
 ```
 
+## Changelog
+
+### 2.2.0-beta 2022-09-26
+
+- Add OSS Sink Connector
+
+### 2.3.0-beta 2022-10-20
+
+- [BugFix] Fix the bug of incorrect path in windows environment 
([2980](https://github.com/apache/seatunnel/pull/2980))
+- [BugFix] Fix filesystem get error 
([3117](https://github.com/apache/seatunnel/pull/3117))
+- [BugFix] Solved the bug of can not parse '\t' as delimiter from config file 
([3083](https://github.com/apache/seatunnel/pull/3083))
+
+### Next version
+
+- [BugFix] Fixed the following bugs that failed to write data to files 
([3258](https://github.com/apache/seatunnel/pull/3258))
+  - When field from upstream is null it will throw NullPointerException
+  - Sink columns mapping failed
+  - When restore writer from states getting transaction directly failed
+- [Improve] Support setting batch size for every file 
([3625](https://github.com/apache/seatunnel/pull/3625))
+- [Improve] Support file compress 
([3899](https://github.com/apache/seatunnel/pull/3899))
+
 ### Tips
 
 > 1.[SeaTunnel Deployment Document](../../start-v2/locally/deployment.md).
diff --git a/docs/en/connector-v2/source/OssFile.md 
b/docs/en/connector-v2/source/OssFile.md
index 87e7e0180f..b448005b4d 100644
--- a/docs/en/connector-v2/source/OssFile.md
+++ b/docs/en/connector-v2/source/OssFile.md
@@ -8,6 +8,17 @@
 > Flink<br/>
 > SeaTunnel Zeta<br/>
 
+## Usage Dependency
+
+### For Spark/Flink Engine
+
+1. You must ensure your spark/flink cluster already integrated hadoop. The 
tested hadoop version is 2.x.
+2. You must ensure `hadoop-aliyun-xx.jar`, `aliyun-sdk-oss-xx.jar` and 
`jdom-xx.jar` in `${SEATUNNEL_HOME}/plugins/` dir and the version of 
`hadoop-aliyun` jar need equals your hadoop version which used in spark/flink 
and `aliyun-sdk-oss-xx.jar` and `jdom-xx.jar` version needs to be the version 
corresponding to the `hadoop-aliyun` version. Eg: `hadoop-aliyun-3.1.4.jar` 
dependency `aliyun-sdk-oss-3.4.1.jar` and `jdom-1.1.jar`.
+
+### For SeaTunnel Zeta Engine
+
+1. You must ensure `seatunnel-hadoop3-3.1.4-uber.jar`, 
`aliyun-sdk-oss-3.4.1.jar`, `hadoop-aliyun-3.1.4.jar` and `jdom-1.1.jar` in 
`${SEATUNNEL_HOME}/lib/` dir.
+
 ## Key features
 
 - [x] [batch](../../concept/connector-v2-features.md)
@@ -27,81 +38,14 @@ Read all the data in a split in a pollNext call. What 
splits are read will be sa
   - [x] json
   - [x] excel
 
-## Description
-
-Read data from aliyun oss file system.
-
-## Supported DataSource Info
-
-In order to use the OssFile connector, the following dependencies are required.
-They can be downloaded via install-plugin.sh or from the Maven central 
repository.
-
-| Datasource | Supported Versions |                                       
Dependency                                       |
-|------------|--------------------|----------------------------------------------------------------------------------------|
-| OssFile    | universal          | 
[Download](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-file-oss)
 |
-
-:::tip
-
-If you use spark/flink, In order to use this connector, You must ensure your 
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when 
you download and install SeaTunnel Engine. You can check the jar package under 
${SEATUNNEL_HOME}/lib to confirm this.
-
-We made some trade-offs in order to support more file types, so we used the 
HDFS protocol for internal access to OSS and this connector need some hadoop 
dependencies.
-It only supports hadoop version **2.9.X+**.
-
-:::
-
 ## Data Type Mapping
 
-The File does not have a specific type list, and we can indicate which 
SeaTunenl data type the corresponding data needs to be converted to by 
specifying the Schema in the config.
-
-| SeaTunnel Data type |
-|---------------------|
-| STRING              |
-| SHORT               |
-| INT                 |
-| BIGINT              |
-| BOOLEAN             |
-| DOUBLE              |
-| DECIMAL             |
-| FLOAT               |
-| DATE                |
-| TIME                |
-| TIMESTAMP           |
-| BYTES               |
-| ARRAY               |
-| MAP                 |
-
-## Source Options
-
-|           Name            |  Type   | Required |    default value    |       
                                                                                
                                                                                
             Description                                                        
                                                                                
                                             |
-|---------------------------|---------|----------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| path                      | String  | Yes      | -                   | The 
source file path.                                                               
                                                                                
                                                                                
                                                                                
                                               |
-| file_format_type          | String  | Yes      | -                   | 
Please check #file_format_type below                                            
                                                                                
                                                                                
                                                                                
                                                   |
-| bucket                    | String  | Yes      | -                   | The 
bucket address of oss file system, for example: `oss://tyrantlucifer-image-bed` 
                                                                                
                                                                                
                                                                                
                                               |
-| endpoint                  | String  | Yes      | -                   | The 
endpoint of oss file system.                                                    
                                                                                
                                                                                
                                                                                
                                               |
-| read_columns              | List    | No       | -                   | The 
read column list of the data source, user can use it to implement field 
projection. <br/> The file type supported column projection as the following 
shown: <br/> - text <br/> - json <br/> - csv <br/> - orc <br/> - parquet <br/> 
- excel <br/> **Tips: If the user wants to use this feature when reading `text` 
`json` `csv` files, the schema option must be configured** |
-| access_key                | String  | No       | -                   | The 
access key of oss file system.                                                  
                                                                                
                                                                                
                                                                                
                                               |
-| access_secret             | String  | No       | -                   | The 
access secret of oss file system.                                               
                                                                                
                                                                                
                                                                                
                                               |
-| file_filter_pattern       | String  | No       | -                   | 
Filter pattern, which used for filtering files.                                 
                                                                                
                                                                                
                                                                                
                                                   |
-| delimiter/field_delimiter | String  | No       | \001                | 
**delimiter** parameter will deprecate after version 2.3.5, please use 
**field_delimiter** instead. <br/> Field delimiter, used to tell connector how 
to slice and dice fields when reading text files. <br/> Default `\001`, the 
same as hive's default delimiter                                                
                                                                 |
-| parse_partition_from_path | Boolean | No       | true                | 
Control whether parse the partition keys and values from file path <br/> For 
example if you read a file from path 
`oss://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26` <br/> 
Every record data from file will be added these two fields: <br/>      name     
  age  <br/> tyrantlucifer  26   <br/> Tips: **Do not define partition fields 
in schema option**    |
-| date_format               | String  | No       | yyyy-MM-dd          | Date 
type format, used to tell connector how to convert string to date, supported as 
the following formats: <br/> `yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` <br/> 
default `yyyy-MM-dd`                                                            
                                                                                
                                                    |
-| datetime_format           | String  | No       | yyyy-MM-dd HH:mm:ss | 
Datetime type format, used to tell connector how to convert string to datetime, 
supported as the following formats: <br/> `yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd 
HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` <br/> default `yyyy-MM-dd 
HH:mm:ss`                                                                       
                                                            |
-| time_format               | String  | No       | HH:mm:ss            | Time 
type format, used to tell connector how to convert string to time, supported as 
the following formats: <br/> `HH:mm:ss` `HH:mm:ss.SSS` <br/> default `HH:mm:ss` 
                                                                                
                                                                                
                                              |
-| skip_header_row_number    | Long    | No       | 0                   | Skip 
the first few lines, but only for the txt and csv. <br/> For example, set like 
following: <br/> `skip_header_row_number = 2` <br/> then SeaTunnel will skip 
the first 2 lines from source files                                             
                                                                                
                                                  |
-| sheet_name                | String  | No       | -                   | 
Reader the sheet of the workbook,Only used when file_format is excel.           
                                                                                
                                                                                
                                                                                
                                                   |
-| schema                    | Config  | No       | -                   | 
Please check #schema below                                                      
                                                                                
                                                                                
                                                                                
                                                   |
-| file_filter_pattern       | string  | no       | -                   | 
Filter pattern, which used for filtering files.                                 
                                                                                
                                                                                
                                                                                
                                                   |
-| compress_codec            | string  | no       | none                | The 
compress codec of files and the details that supported as the following shown: 
<br/> - txt: `lzo` `none` <br/> - json: `lzo` `none` <br/> - csv: `lzo` `none` 
<br/> - orc/parquet: automatically recognizes the compression type, no 
additional settings required.                                                   
                                                          |
-| common-options            |         | No       | -                   | 
Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.                                        
                                                                                
                                                                                
                                                                   |
-
-### file_format_type [string]
-
-File type, supported as the following file types:
+Data type mapping is related to the type of file being read, We supported as 
the following file types:
 
 `text` `csv` `parquet` `orc` `json` `excel`
 
+### JSON File Type
+
 If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
 
 For example:
@@ -143,7 +87,7 @@ connector will generate data as the following:
 |------|-------------|---------|
 | 200  | get success | true    |
 
-If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
+### Text Or CSV File Type
 
 If you assign file type to `text` `csv`, you can choose to specify the schema 
information or not.
 
@@ -184,6 +128,101 @@ connector will generate data as the following:
 |---------------|-----|--------|
 | tyrantlucifer | 26  | male   |
 
+### Orc File Type
+
+If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
+
+|          Orc Data type           |                      SeaTunnel Data type  
                     |
+|----------------------------------|----------------------------------------------------------------|
+| BOOLEAN                          | BOOLEAN                                   
                     |
+| INT                              | INT                                       
                     |
+| BYTE                             | BYTE                                      
                     |
+| SHORT                            | SHORT                                     
                     |
+| LONG                             | LONG                                      
                     |
+| FLOAT                            | FLOAT                                     
                     |
+| DOUBLE                           | DOUBLE                                    
                     |
+| BINARY                           | BINARY                                    
                     |
+| STRING<br/>VARCHAR<br/>CHAR<br/> | STRING                                    
                     |
+| DATE                             | LOCAL_DATE_TYPE                           
                     |
+| TIMESTAMP                        | LOCAL_DATE_TIME_TYPE                      
                     |
+| DECIMAL                          | DECIMAL                                   
                     |
+| LIST(STRING)                     | STRING_ARRAY_TYPE                         
                     |
+| LIST(BOOLEAN)                    | BOOLEAN_ARRAY_TYPE                        
                     |
+| LIST(TINYINT)                    | BYTE_ARRAY_TYPE                           
                     |
+| LIST(SMALLINT)                   | SHORT_ARRAY_TYPE                          
                     |
+| LIST(INT)                        | INT_ARRAY_TYPE                            
                     |
+| LIST(BIGINT)                     | LONG_ARRAY_TYPE                           
                     |
+| LIST(FLOAT)                      | FLOAT_ARRAY_TYPE                          
                     |
+| LIST(DOUBLE)                     | DOUBLE_ARRAY_TYPE                         
                     |
+| Map<K,V>                         | MapType, This type of K and V will 
transform to SeaTunnel type |
+| STRUCT                           | SeaTunnelRowType                          
                     |
+
+### Parquet File Type
+
+If you assign file type to `parquet` `orc`, schema option not required, 
connector can find the schema of upstream data automatically.
+
+|    Orc Data type     |                      SeaTunnel Data type              
         |
+|----------------------|----------------------------------------------------------------|
+| INT_8                | BYTE                                                  
         |
+| INT_16               | SHORT                                                 
         |
+| DATE                 | DATE                                                  
         |
+| TIMESTAMP_MILLIS     | TIMESTAMP                                             
         |
+| INT64                | LONG                                                  
         |
+| INT96                | TIMESTAMP                                             
         |
+| BINARY               | BYTES                                                 
         |
+| FLOAT                | FLOAT                                                 
         |
+| DOUBLE               | DOUBLE                                                
         |
+| BOOLEAN              | BOOLEAN                                               
         |
+| FIXED_LEN_BYTE_ARRAY | TIMESTAMP<br/> DECIMAL                                
         |
+| DECIMAL              | DECIMAL                                               
         |
+| LIST(STRING)         | STRING_ARRAY_TYPE                                     
         |
+| LIST(BOOLEAN)        | BOOLEAN_ARRAY_TYPE                                    
         |
+| LIST(TINYINT)        | BYTE_ARRAY_TYPE                                       
         |
+| LIST(SMALLINT)       | SHORT_ARRAY_TYPE                                      
         |
+| LIST(INT)            | INT_ARRAY_TYPE                                        
         |
+| LIST(BIGINT)         | LONG_ARRAY_TYPE                                       
         |
+| LIST(FLOAT)          | FLOAT_ARRAY_TYPE                                      
         |
+| LIST(DOUBLE)         | DOUBLE_ARRAY_TYPE                                     
         |
+| Map<K,V>             | MapType, This type of K and V will transform to 
SeaTunnel type |
+| STRUCT               | SeaTunnelRowType                                      
         |
+
+## Options
+
+|           name            |  type   | required |    default value    |       
                                                                                
                                                                   Description  
                                                                                
                                                                        |
+|---------------------------|---------|----------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path                      | string  | yes      | -                   | The 
Oss path that needs to be read can have sub paths, but the sub paths need to 
meet certain format requirements. Specific requirements can be referred to 
"parse_partition_from_path" option                                              
                                                                                
  |
+| file_format_type          | string  | yes      | -                   | File 
type, supported as the following file types: `text` `csv` `parquet` `orc` 
`json` `excel`                                                                  
                                                                                
                                                                               |
+| bucket                    | string  | yes      | -                   | The 
bucket address of oss file system, for example: `oss://seatunnel-test`.         
                                                                                
                                                                                
                                                                          |
+| endpoint                  | string  | yes      | -                   | fs 
oss endpoint                                                                    
                                                                                
                                                                                
                                                                           |
+| read_columns              | list    | no       | -                   | The 
read column list of the data source, user can use it to implement field 
projection. The file type supported column projection as the following shown: 
`text` `csv` `parquet` `orc` `json` `excel` . If the user wants to use this 
feature when reading `text` `json` `csv` files, the "schema" option must be 
configured. |
+| access_key                | string  | no       | -                   |       
                                                                                
                                                                                
                                                                                
                                                                        |
+| access_secret             | string  | no       | -                   |       
                                                                                
                                                                                
                                                                                
                                                                        |
+| delimiter                 | string  | no       | \001                | Field 
delimiter, used to tell connector how to slice and dice fields when reading 
text files. Default `\001`, the same as hive's default delimiter.               
                                                                                
                                                                            |
+| parse_partition_from_path | boolean | no       | true                | 
Control whether parse the partition keys and values from file path. For example 
if you read a file from path 
`oss://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every 
record data from file will be added these two fields: name="tyrantlucifer", 
age=16                                                 |
+| date_format               | string  | no       | yyyy-MM-dd          | Date 
type format, used to tell connector how to convert string to date, supported as 
the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`. default 
`yyyy-MM-dd`                                                                    
                                                                                
   |
+| datetime_format           | string  | no       | yyyy-MM-dd HH:mm:ss | 
Datetime type format, used to tell connector how to convert string to datetime, 
supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` 
`yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss`                                          
                                                                               |
+| time_format               | string  | no       | HH:mm:ss            | Time 
type format, used to tell connector how to convert string to time, supported as 
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`                                 
                                                                                
                                                                         |
+| skip_header_row_number    | long    | no       | 0                   | Skip 
the first few lines, but only for the txt and csv. For example, set like 
following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2 
lines from source files                                                         
                                                                                
   |
+| schema                    | config  | no       | -                   | The 
schema of upstream data.                                                        
                                                                                
                                                                                
                                                                          |
+| sheet_name                | string  | no       | -                   | 
Reader the sheet of the workbook,Only used when file_format is excel.           
                                                                                
                                                                                
                                                                              |
+| compress_codec            | string  | no       | none                | Which 
compress codec the files used.                                                  
                                                                                
                                                                                
                                                                        |
+| file_filter_pattern       | string  | no       |                     | 
`*.txt` means you only need read the files end with `.txt`                      
                                                                                
                                                                                
                                                                              |
+| common-options            | config  | no       | -                   | 
Source plugin common parameters, please refer to [Source Common 
Options](common-options.md) for details.                                        
                                                                                
                                                                                
              |
+
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following 
shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:  
+  automatically recognizes the compression type, no additional settings 
required.
+
+### file_filter_pattern [string]
+
+Filter pattern, which used for filtering files.
+
 ### schema [config]
 
 Only need to be configured when the file_format_type are text, json, excel or 
csv ( Or other format we can't read the schema from metadata).
@@ -254,6 +293,18 @@ sink {
 }
 ```
 
+## Changelog
+
+### 2.2.0-beta 2022-09-26
+
+- Add OSS File Source Connector
+
+### 2.3.0-beta 2022-10-20
+
+- [BugFix] Fix the bug of incorrect path in windows environment 
([2980](https://github.com/apache/seatunnel/pull/2980))
+- [Improve] Support extract partition from SeaTunnelRow fields 
([3085](https://github.com/apache/seatunnel/pull/3085))
+- [Improve] Support parse field from file path 
([2985](https://github.com/apache/seatunnel/pull/2985))
+
 ### Tips
 
 > 1.[SeaTunnel Deployment Document](../../start-v2/locally/deployment.md).
diff --git a/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml 
b/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
index b25ce135f2..c904257b90 100644
--- a/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
+++ b/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
@@ -30,7 +30,9 @@
     <name>SeaTunnel : Connectors V2 : File : Oss</name>
 
     <properties>
-        <hadoop-aliyun.version>2.9.2</hadoop-aliyun.version>
+        <aliyun.sdk.oss.version>3.4.1</aliyun.sdk.oss.version>
+        <hadoop-aliyun.version>3.1.4</hadoop-aliyun.version>
+        <jdom.version>1.1</jdom.version>
     </properties>
 
     <dependencies>
@@ -41,8 +43,10 @@
             <version>${project.version}</version>
         </dependency>
         <dependency>
-            <groupId>org.apache.flink</groupId>
-            <artifactId>flink-shaded-hadoop-2</artifactId>
+            <groupId>org.apache.seatunnel</groupId>
+            <artifactId>seatunnel-hadoop3-3.1.4-uber</artifactId>
+            <version>${project.version}</version>
+            <classifier>optional</classifier>
             <scope>provided</scope>
             <exclusions>
                 <exclusion>
@@ -51,10 +55,23 @@
                 </exclusion>
             </exclusions>
         </dependency>
+        <dependency>
+            <groupId>com.aliyun.oss</groupId>
+            <artifactId>aliyun-sdk-oss</artifactId>
+            <version>${aliyun.sdk.oss.version}</version>
+            <scope>provided</scope>
+        </dependency>
+        <dependency>
+            <groupId>org.jdom</groupId>
+            <artifactId>jdom</artifactId>
+            <version>${jdom.version}</version>
+            <scope>provided</scope>
+        </dependency>
         <dependency>
             <groupId>org.apache.hadoop</groupId>
             <artifactId>hadoop-aliyun</artifactId>
             <version>${hadoop-aliyun.version}</version>
+            <scope>provided</scope>
         </dependency>
     </dependencies>
 
diff --git a/seatunnel-dist/pom.xml b/seatunnel-dist/pom.xml
index 2dcde33436..dac8d6978b 100644
--- a/seatunnel-dist/pom.xml
+++ b/seatunnel-dist/pom.xml
@@ -105,6 +105,10 @@
                 <hadoop-aws.version>3.1.4</hadoop-aws.version>
                 <aws-java-sdk.version>1.11.271</aws-java-sdk.version>
                 <netty-buffer.version>4.1.89.Final</netty-buffer.version>
+                <hive.exec.version>2.3.9</hive.exec.version>
+                <hive.jdbc.version>3.1.3</hive.jdbc.version>
+                <aliyun.sdk.oss.version>3.4.1</aliyun.sdk.oss.version>
+                <jdom.version>1.1</jdom.version>
             </properties>
             <dependencies>
                 <!-- starters -->
@@ -638,14 +642,12 @@
                     <version>${hadoop-aliyun.version}</version>
                     <scope>provided</scope>
                 </dependency>
-
                 <dependency>
                     <groupId>org.apache.hadoop</groupId>
                     <artifactId>hadoop-aws</artifactId>
                     <version>${hadoop-aws.version}</version>
                     <scope>provided</scope>
                 </dependency>
-
                 <dependency>
                     <groupId>org.apache.seatunnel</groupId>
                     <artifactId>seatunnel-hadoop3-3.1.4-uber</artifactId>
@@ -653,6 +655,24 @@
                     <classifier>optional</classifier>
                     <scope>provided</scope>
                 </dependency>
+                <dependency>
+                    <groupId>com.aliyun.oss</groupId>
+                    <artifactId>aliyun-sdk-oss</artifactId>
+                    <version>${aliyun.sdk.oss.version}</version>
+                    <scope>provided</scope>
+                </dependency>
+                <dependency>
+                    <groupId>org.jdom</groupId>
+                    <artifactId>jdom</artifactId>
+                    <version>${jdom.version}</version>
+                    <scope>provided</scope>
+                </dependency>
+                <dependency>
+                    <groupId>org.apache.hadoop</groupId>
+                    <artifactId>hadoop-aliyun</artifactId>
+                    <version>${hadoop-aliyun.version}</version>
+                    <scope>provided</scope>
+                </dependency>
                 <!-- hadoop jar end -->
             </dependencies>
             <repositories>
@@ -700,6 +720,11 @@
                     <groupId>org.apache.seatunnel</groupId>
                     <artifactId>seatunnel-spark-3-starter</artifactId>
                     <version>${project.version}</version>
+                </dependency>
+                <dependency>
+                    <groupId>com.amazonaws</groupId>
+                    <artifactId>aws-java-sdk-bundle</artifactId>
+                    <version>1.11.271</version>
                     <scope>provided</scope>
                 </dependency>
                 <!-- seatunnel connectors for demo -->
@@ -724,7 +749,6 @@
                     <scope>provided</scope>
                 </dependency>
 
-                <!-- hadoop jar -->
                 <dependency>
                     <groupId>org.apache.seatunnel</groupId>
                     <artifactId>seatunnel-hadoop3-3.1.4-uber</artifactId>

(seatunnel) branch dev updated: [Hotfix][Oss File Connector] fix oss connector can not run bug (#6010)

Reply via email to