This is an automated email from the ASF dual-hosted git repository.
wanghailin pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git
The following commit(s) were added to refs/heads/dev by this push:
new 755bc2a730 [Hotfix][Oss File Connector] fix oss connector can not run
bug (#6010)
755bc2a730 is described below
commit 755bc2a7307c5ff590c5a584e528405e88ef8aff
Author: Eric <[email protected]>
AuthorDate: Tue Dec 19 21:07:50 2023 +0800
[Hotfix][Oss File Connector] fix oss connector can not run bug (#6010)
---
docs/en/connector-v2/sink/OssFile.md | 254 +++++++++++++++++----
docs/en/connector-v2/source/OssFile.md | 193 ++++++++++------
.../connector-file/connector-file-oss/pom.xml | 23 +-
seatunnel-dist/pom.xml | 30 ++-
4 files changed, 380 insertions(+), 120 deletions(-)
diff --git a/docs/en/connector-v2/sink/OssFile.md
b/docs/en/connector-v2/sink/OssFile.md
index 3604748477..4bfe6cc9bc 100644
--- a/docs/en/connector-v2/sink/OssFile.md
+++ b/docs/en/connector-v2/sink/OssFile.md
@@ -8,6 +8,17 @@
> Flink<br/>
> SeaTunnel Zeta<br/>
+## Usage Dependency
+
+### For Spark/Flink Engine
+
+1. You must ensure your spark/flink cluster already integrated hadoop. The
tested hadoop version is 2.x.
+2. You must ensure `hadoop-aliyun-xx.jar`, `aliyun-sdk-oss-xx.jar` and
`jdom-xx.jar` in `${SEATUNNEL_HOME}/plugins/` dir and the version of
`hadoop-aliyun` jar need equals your hadoop version which used in spark/flink
and `aliyun-sdk-oss-xx.jar` and `jdom-xx.jar` version needs to be the version
corresponding to the `hadoop-aliyun` version. Eg: `hadoop-aliyun-3.1.4.jar`
dependency `aliyun-sdk-oss-3.4.1.jar` and `jdom-1.1.jar`.
+
+### For SeaTunnel Zeta Engine
+
+1. You must ensure `seatunnel-hadoop3-3.1.4-uber.jar`,
`aliyun-sdk-oss-3.4.1.jar`, `hadoop-aliyun-3.1.4.jar` and `jdom-1.1.jar` in
`${SEATUNNEL_HOME}/lib/` dir.
+
## Key features
- [x] [exactly-once](../../concept/connector-v2-features.md)
@@ -22,61 +33,114 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
-## Description
+## Data Type Mapping
-Output data to oss file system.
+If write to `csv`, `text` file type, All column will be string.
+
+### Orc File Type
+
+| SeaTunnel Data type | Orc Data type |
+|----------------------|-----------------------|
+| STRING | STRING |
+| BOOLEAN | BOOLEAN |
+| TINYINT | BYTE |
+| SMALLINT | SHORT |
+| INT | INT |
+| BIGINT | LONG |
+| FLOAT | FLOAT |
+| FLOAT | FLOAT |
+| DOUBLE | DOUBLE |
+| DECIMAL | DECIMAL |
+| BYTES | BINARY |
+| DATE | DATE |
+| TIME <br/> TIMESTAMP | TIMESTAMP |
+| ROW | STRUCT |
+| NULL | UNSUPPORTED DATA TYPE |
+| ARRAY | LIST |
+| Map | Map |
+
+### Parquet File Type
+
+| SeaTunnel Data type | Parquet Data type |
+|----------------------|-----------------------|
+| STRING | STRING |
+| BOOLEAN | BOOLEAN |
+| TINYINT | INT_8 |
+| SMALLINT | INT_16 |
+| INT | INT32 |
+| BIGINT | INT64 |
+| FLOAT | FLOAT |
+| FLOAT | FLOAT |
+| DOUBLE | DOUBLE |
+| DECIMAL | DECIMAL |
+| BYTES | BINARY |
+| DATE | DATE |
+| TIME <br/> TIMESTAMP | TIMESTAMP_MILLIS |
+| ROW | GroupType |
+| NULL | UNSUPPORTED DATA TYPE |
+| ARRAY | LIST |
+| Map | Map |
-## Supported DataSource Info
+## Options
-In order to use the OssFile connector, the following dependencies are required.
-They can be downloaded via install-plugin.sh or from the Maven central
repository.
+| name | type | required |
default value |
remarks |
+|----------------------------------|---------|----------|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
+| path | string | yes | The oss path to
write file in. |
|
+| tmp_path | string | no | /tmp/seatunnel
| The result file will write to a tmp path first and then
use `mv` to submit tmp dir to target dir. Need a OSS dir. |
+| bucket | string | yes | -
|
|
+| access_key | string | yes | -
|
|
+| access_secret | string | yes | -
|
|
+| endpoint | string | yes | -
|
|
+| custom_filename | boolean | no | false
| Whether you need custom the filename
|
+| file_name_expression | string | no | "${transactionId}"
| Only used when custom_filename is true
|
+| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
+| file_format_type | string | no | "csv"
|
|
+| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
+| row_delimiter | string | no | "\n"
| Only used when file_format_type is text
|
+| have_partition | boolean | no | false
| Whether you need processing partitions.
|
+| partition_by | array | no | -
| Only used then have_partition is true
|
+| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true
|
+| is_partition_field_write_in_file | boolean | no | false
| Only used then have_partition is true
|
+| sink_columns | array | no |
| When this parameter is empty, all fields are sink
columns |
+| is_enable_transaction | boolean | no | true
|
|
+| batch_size | int | no | 1000000
|
|
+| compress_codec | string | no | none
|
|
+| common-options | object | no | -
|
|
+| max_rows_in_memory | int | no | -
| Only used when file_format_type is excel.
|
+| sheet_name | string | no | Sheet${Random
number} | Only used when file_format_type is excel.
|
-| Datasource | Supported Versions |
Dependency |
-|------------|--------------------|----------------------------------------------------------------------------------------|
-| OssFile | universal |
[Download](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-file-oss)
|
+### path [string]
-:::tip
+The target dir path is required.
-If you use spark/flink, In order to use this connector, You must ensure your
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+### bucket [string]
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when
you download and install SeaTunnel Engine. You can check the jar package under
${SEATUNNEL_HOME}/lib to confirm this.
+The bucket address of oss file system, for example:
`oss://tyrantlucifer-image-bed`
-We made some trade-offs in order to support more file types, so we used the
HDFS protocol for internal access to OSS and this connector need some hadoop
dependencies.
-It only supports hadoop version **2.9.X+**.
+### access_key [string]
-:::
+The access key of oss file system.
-## Data Type Mapping
+### access_secret [string]
-SeaTunnel will write the data into the file in String format according to the
SeaTunnel data type and file_format_type.
+The access secret of oss file system.
-## Options
+### endpoint [string]
+
+The endpoint of oss file system.
+
+### custom_filename [boolean]
+
+Whether custom the filename
+
+### file_name_expression [string]
+
+Only used when `custom_filename` is `true`
+
+`file_name_expression` describes the file expression which will be created
into the `path`. We can add the variable `${now}` or `${uuid}` in the
`file_name_expression`, like `test_${uuid}_${now}`,
+`${now}` represents the current time, and its format can be defined by
specifying the option `filename_time_format`.
-| Name | Type | Required |
Default value |
Description
[...]
-|----------------------------------|---------|----------|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-| path | String | Yes | -
| The target dir path is required.
[...]
-| tmp_path | string | no | /tmp/seatunnel
| The result file will write to a tmp path first and then
use `mv` to submit tmp dir to target dir. Need a OSS dir.
[...]
-| bucket | String | Yes | -
| The bucket address of oss file system, for example:
`oss://tyrantlucifer-image-bed`
[...]
-| access_key | String | No | -
| The access key of oss file system.
[...]
-| access_secret | String | No | -
| The access secret of oss file system.
[...]
-| endpoint | String | Yes | -
| The endpoint of oss file system.
[...]
-| custom_filename | Boolean | No | false
| Whether you need custom the filename
[...]
-| file_name_expression | String | No | "${transactionId}"
| Only used when `custom_filename` is `true`. <br/>
`file_name_expression` describes the file expression which will be created into
the `path`. We can add the variable `${Now}` or `${uuid}` in the
`file_name_expression`, like `test_${uuid}_${Now}`, `${Now}` represents the
current time, and its format can be defined by specifying the option
`filename_time_format`. <br/>Please Note that, If [...]
-| filename_time_format | String | No | "yyyy.MM.dd"
| Please check #filename_time_format below
[...]
-| file_format_type | String | No | "csv"
| We supported as the following file types: <br/> `text`
`json` `csv` `orc` `parquet` `excel` <br/> Please Note that, The final file
name will end with the file_format's suffix, the suffix of the text file is
`txt`.
[...]
-| field_delimiter | String | No | '\001'
| The separator between columns in a row of data. Only
needed by `text` file format.
[...]
-| row_delimiter | String | No | "\n"
| The separator between rows in a file. Only needed by
`text` file format.
[...]
-| have_partition | Boolean | No | false
| Whether you need processing partitions.
[...]
-| partition_by | Array | No | -
| Only used when `have_partition` is `true`. <br/>
Partition data based on selected fields.
[...]
-| partition_dir_expression | String | No |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used when `have_partition` is
`true`. <br/> If the `partition_by` is specified, we will generate the
corresponding partition directory based on the partition information, and the
final file will be placed in the partition directory. <br/> Default
`partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0`
is the first partition field and `v0` is the value of the [...]
-| is_partition_field_write_in_file | Boolean | No | false
| Only used when `have_partition` is `true`. <br/> If
`is_partition_field_write_in_file` is `true`, the partition field and the value
of it will be write into data file. <br/> For example, if you want to write a
Hive Data File, Its value should be `false`.
[...]
-| sink_columns | Array | No |
| Which columns need be written to file, default value is
all the columns get from `Transform` or `Source`. <br/> The order of the fields
determines the order in which the file is actually written.
[...]
-| is_enable_transaction | Boolean | No | true
| If `is_enable_transaction` is true, we will ensure that
data will Not be lost or duplicated when it is written to the target directory.
<br/> Please Note that, If `is_enable_transaction` is `true`, we will auto add
`${transactionId}_` in the head of the file. <br/> Only support `true` Now.
[...]
-| batch_size | Int | No | 1000000
| The maximum number of rows in a file. For SeaTunnel
Engine, the number of lines in the file is determined by `batch_size` and
`checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is
large eNough, sink writer will write rows in a file until the rows in the file
larger than `batch_size`. If `checkpoint.interval` is small, the sink writer
will create a new file when [...]
-| compress_codec | String | No | None
| The compress codec of files and the details that
supported as the following shown: <br/> - txt: `lzo` `None` <br/> - json: `lzo`
`None` <br/> - csv: `lzo` `None` <br/> - orc: `lzo` `snappy` `lz4` `zlib`
`None` <br/> - parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `None`
<br/> Tips: excel type does Not support any compression format
[...]
-| max_rows_in_memory | Int | No | -
| When File Format is Excel,The maximum number of data
items that can be cached in the memory.
[...]
-| sheet_name | String | No | Sheet${Random
number} | Writer the sheet of the workbook
[...]
-| common-options | Config | No | -
| Sink plugin common parameters, please refer to [Sink
Common Options](common-options.md) for details.
[...]
+Please note that, If `is_enable_transaction` is `true`, we will auto add
`${transactionId}_` in the head of the file.
### filename_time_format [String]
@@ -93,7 +157,90 @@ When the format in the `file_name_expression` parameter is
`xxxx-${Now}` , `file
| m | Minute in hour |
| s | Second in minute |
-## How to Create a Oss Data Synchronization Jobs
+### file_format_type [string]
+
+We supported as the following file types:
+
+`text` `json` `csv` `orc` `parquet` `excel`
+
+Please note that, The final file name will end with the file_format_type's
suffix, the suffix of the text file is `txt`.
+
+### field_delimiter [string]
+
+The separator between columns in a row of data. Only needed by `text` file
format.
+
+### row_delimiter [string]
+
+The separator between rows in a file. Only needed by `text` file format.
+
+### have_partition [boolean]
+
+Whether you need processing partitions.
+
+### partition_by [array]
+
+Only used when `have_partition` is `true`.
+
+Partition data based on selected fields.
+
+### partition_dir_expression [string]
+
+Only used when `have_partition` is `true`.
+
+If the `partition_by` is specified, we will generate the corresponding
partition directory based on the partition information, and the final file will
be placed in the partition directory.
+
+Default `partition_dir_expression` is
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field
and `v0` is the value of the first partition field.
+
+### is_partition_field_write_in_file [boolean]
+
+Only used when `have_partition` is `true`.
+
+If `is_partition_field_write_in_file` is `true`, the partition field and the
value of it will be write into data file.
+
+For example, if you want to write a Hive Data File, Its value should be
`false`.
+
+### sink_columns [array]
+
+Which columns need be written to file, default value is all the columns get
from `Transform` or `Source`.
+The order of the fields determines the order in which the file is actually
written.
+
+### is_enable_transaction [boolean]
+
+If `is_enable_transaction` is true, we will ensure that data will not be lost
or duplicated when it is written to the target directory.
+
+Please note that, If `is_enable_transaction` is `true`, we will auto add
`${transactionId}_` in the head of the file.
+
+Only support `true` now.
+
+### batch_size [int]
+
+The maximum number of rows in a file. For SeaTunnel Engine, the number of
lines in the file is determined by `batch_size` and `checkpoint.interval`
jointly decide. If the value of `checkpoint.interval` is large enough, sink
writer will write rows in a file until the rows in the file larger than
`batch_size`. If `checkpoint.interval` is small, the sink writer will create a
new file when a new checkpoint trigger.
+
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following
shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc: `lzo` `snappy` `lz4` `zlib` `none`
+- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`
+
+Tips: excel type does not support any compression format
+
+### common options
+
+Sink plugin common parameters, please refer to [Sink Common
Options](common-options.md) for details.
+
+### max_rows_in_memory [int]
+
+When File Format is Excel,The maximum number of data items that can be cached
in the memory.
+
+### sheet_name [string]
+
+Writer the sheet of the workbook
+
+## How to Create an Oss Data Synchronization Jobs
The following example demonstrates how to create a data synchronization job
that reads data from Fake Source and writes it to the Oss:
@@ -215,6 +362,27 @@ sink {
}
```
+## Changelog
+
+### 2.2.0-beta 2022-09-26
+
+- Add OSS Sink Connector
+
+### 2.3.0-beta 2022-10-20
+
+- [BugFix] Fix the bug of incorrect path in windows environment
([2980](https://github.com/apache/seatunnel/pull/2980))
+- [BugFix] Fix filesystem get error
([3117](https://github.com/apache/seatunnel/pull/3117))
+- [BugFix] Solved the bug of can not parse '\t' as delimiter from config file
([3083](https://github.com/apache/seatunnel/pull/3083))
+
+### Next version
+
+- [BugFix] Fixed the following bugs that failed to write data to files
([3258](https://github.com/apache/seatunnel/pull/3258))
+ - When field from upstream is null it will throw NullPointerException
+ - Sink columns mapping failed
+ - When restore writer from states getting transaction directly failed
+- [Improve] Support setting batch size for every file
([3625](https://github.com/apache/seatunnel/pull/3625))
+- [Improve] Support file compress
([3899](https://github.com/apache/seatunnel/pull/3899))
+
### Tips
> 1.[SeaTunnel Deployment Document](../../start-v2/locally/deployment.md).
diff --git a/docs/en/connector-v2/source/OssFile.md
b/docs/en/connector-v2/source/OssFile.md
index 87e7e0180f..b448005b4d 100644
--- a/docs/en/connector-v2/source/OssFile.md
+++ b/docs/en/connector-v2/source/OssFile.md
@@ -8,6 +8,17 @@
> Flink<br/>
> SeaTunnel Zeta<br/>
+## Usage Dependency
+
+### For Spark/Flink Engine
+
+1. You must ensure your spark/flink cluster already integrated hadoop. The
tested hadoop version is 2.x.
+2. You must ensure `hadoop-aliyun-xx.jar`, `aliyun-sdk-oss-xx.jar` and
`jdom-xx.jar` in `${SEATUNNEL_HOME}/plugins/` dir and the version of
`hadoop-aliyun` jar need equals your hadoop version which used in spark/flink
and `aliyun-sdk-oss-xx.jar` and `jdom-xx.jar` version needs to be the version
corresponding to the `hadoop-aliyun` version. Eg: `hadoop-aliyun-3.1.4.jar`
dependency `aliyun-sdk-oss-3.4.1.jar` and `jdom-1.1.jar`.
+
+### For SeaTunnel Zeta Engine
+
+1. You must ensure `seatunnel-hadoop3-3.1.4-uber.jar`,
`aliyun-sdk-oss-3.4.1.jar`, `hadoop-aliyun-3.1.4.jar` and `jdom-1.1.jar` in
`${SEATUNNEL_HOME}/lib/` dir.
+
## Key features
- [x] [batch](../../concept/connector-v2-features.md)
@@ -27,81 +38,14 @@ Read all the data in a split in a pollNext call. What
splits are read will be sa
- [x] json
- [x] excel
-## Description
-
-Read data from aliyun oss file system.
-
-## Supported DataSource Info
-
-In order to use the OssFile connector, the following dependencies are required.
-They can be downloaded via install-plugin.sh or from the Maven central
repository.
-
-| Datasource | Supported Versions |
Dependency |
-|------------|--------------------|----------------------------------------------------------------------------------------|
-| OssFile | universal |
[Download](https://mvnrepository.com/artifact/org.apache.seatunnel/connector-file-oss)
|
-
-:::tip
-
-If you use spark/flink, In order to use this connector, You must ensure your
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when
you download and install SeaTunnel Engine. You can check the jar package under
${SEATUNNEL_HOME}/lib to confirm this.
-
-We made some trade-offs in order to support more file types, so we used the
HDFS protocol for internal access to OSS and this connector need some hadoop
dependencies.
-It only supports hadoop version **2.9.X+**.
-
-:::
-
## Data Type Mapping
-The File does not have a specific type list, and we can indicate which
SeaTunenl data type the corresponding data needs to be converted to by
specifying the Schema in the config.
-
-| SeaTunnel Data type |
-|---------------------|
-| STRING |
-| SHORT |
-| INT |
-| BIGINT |
-| BOOLEAN |
-| DOUBLE |
-| DECIMAL |
-| FLOAT |
-| DATE |
-| TIME |
-| TIMESTAMP |
-| BYTES |
-| ARRAY |
-| MAP |
-
-## Source Options
-
-| Name | Type | Required | default value |
Description
|
-|---------------------------|---------|----------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| path | String | Yes | - | The
source file path.
|
-| file_format_type | String | Yes | - |
Please check #file_format_type below
|
-| bucket | String | Yes | - | The
bucket address of oss file system, for example: `oss://tyrantlucifer-image-bed`
|
-| endpoint | String | Yes | - | The
endpoint of oss file system.
|
-| read_columns | List | No | - | The
read column list of the data source, user can use it to implement field
projection. <br/> The file type supported column projection as the following
shown: <br/> - text <br/> - json <br/> - csv <br/> - orc <br/> - parquet <br/>
- excel <br/> **Tips: If the user wants to use this feature when reading `text`
`json` `csv` files, the schema option must be configured** |
-| access_key | String | No | - | The
access key of oss file system.
|
-| access_secret | String | No | - | The
access secret of oss file system.
|
-| file_filter_pattern | String | No | - |
Filter pattern, which used for filtering files.
|
-| delimiter/field_delimiter | String | No | \001 |
**delimiter** parameter will deprecate after version 2.3.5, please use
**field_delimiter** instead. <br/> Field delimiter, used to tell connector how
to slice and dice fields when reading text files. <br/> Default `\001`, the
same as hive's default delimiter
|
-| parse_partition_from_path | Boolean | No | true |
Control whether parse the partition keys and values from file path <br/> For
example if you read a file from path
`oss://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26` <br/>
Every record data from file will be added these two fields: <br/> name
age <br/> tyrantlucifer 26 <br/> Tips: **Do not define partition fields
in schema option** |
-| date_format | String | No | yyyy-MM-dd | Date
type format, used to tell connector how to convert string to date, supported as
the following formats: <br/> `yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` <br/>
default `yyyy-MM-dd`
|
-| datetime_format | String | No | yyyy-MM-dd HH:mm:ss |
Datetime type format, used to tell connector how to convert string to datetime,
supported as the following formats: <br/> `yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd
HH:mm:ss` `yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` <br/> default `yyyy-MM-dd
HH:mm:ss`
|
-| time_format | String | No | HH:mm:ss | Time
type format, used to tell connector how to convert string to time, supported as
the following formats: <br/> `HH:mm:ss` `HH:mm:ss.SSS` <br/> default `HH:mm:ss`
|
-| skip_header_row_number | Long | No | 0 | Skip
the first few lines, but only for the txt and csv. <br/> For example, set like
following: <br/> `skip_header_row_number = 2` <br/> then SeaTunnel will skip
the first 2 lines from source files
|
-| sheet_name | String | No | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
-| schema | Config | No | - |
Please check #schema below
|
-| file_filter_pattern | string | no | - |
Filter pattern, which used for filtering files.
|
-| compress_codec | string | no | none | The
compress codec of files and the details that supported as the following shown:
<br/> - txt: `lzo` `none` <br/> - json: `lzo` `none` <br/> - csv: `lzo` `none`
<br/> - orc/parquet: automatically recognizes the compression type, no
additional settings required.
|
-| common-options | | No | - |
Source plugin common parameters, please refer to [Source Common
Options](common-options.md) for details.
|
-
-### file_format_type [string]
-
-File type, supported as the following file types:
+Data type mapping is related to the type of file being read, We supported as
the following file types:
`text` `csv` `parquet` `orc` `json` `excel`
+### JSON File Type
+
If you assign file type to `json`, you should also assign schema option to
tell connector how to parse data to the row you want.
For example:
@@ -143,7 +87,7 @@ connector will generate data as the following:
|------|-------------|---------|
| 200 | get success | true |
-If you assign file type to `parquet` `orc`, schema option not required,
connector can find the schema of upstream data automatically.
+### Text Or CSV File Type
If you assign file type to `text` `csv`, you can choose to specify the schema
information or not.
@@ -184,6 +128,101 @@ connector will generate data as the following:
|---------------|-----|--------|
| tyrantlucifer | 26 | male |
+### Orc File Type
+
+If you assign file type to `parquet` `orc`, schema option not required,
connector can find the schema of upstream data automatically.
+
+| Orc Data type | SeaTunnel Data type
|
+|----------------------------------|----------------------------------------------------------------|
+| BOOLEAN | BOOLEAN
|
+| INT | INT
|
+| BYTE | BYTE
|
+| SHORT | SHORT
|
+| LONG | LONG
|
+| FLOAT | FLOAT
|
+| DOUBLE | DOUBLE
|
+| BINARY | BINARY
|
+| STRING<br/>VARCHAR<br/>CHAR<br/> | STRING
|
+| DATE | LOCAL_DATE_TYPE
|
+| TIMESTAMP | LOCAL_DATE_TIME_TYPE
|
+| DECIMAL | DECIMAL
|
+| LIST(STRING) | STRING_ARRAY_TYPE
|
+| LIST(BOOLEAN) | BOOLEAN_ARRAY_TYPE
|
+| LIST(TINYINT) | BYTE_ARRAY_TYPE
|
+| LIST(SMALLINT) | SHORT_ARRAY_TYPE
|
+| LIST(INT) | INT_ARRAY_TYPE
|
+| LIST(BIGINT) | LONG_ARRAY_TYPE
|
+| LIST(FLOAT) | FLOAT_ARRAY_TYPE
|
+| LIST(DOUBLE) | DOUBLE_ARRAY_TYPE
|
+| Map<K,V> | MapType, This type of K and V will
transform to SeaTunnel type |
+| STRUCT | SeaTunnelRowType
|
+
+### Parquet File Type
+
+If you assign file type to `parquet` `orc`, schema option not required,
connector can find the schema of upstream data automatically.
+
+| Orc Data type | SeaTunnel Data type
|
+|----------------------|----------------------------------------------------------------|
+| INT_8 | BYTE
|
+| INT_16 | SHORT
|
+| DATE | DATE
|
+| TIMESTAMP_MILLIS | TIMESTAMP
|
+| INT64 | LONG
|
+| INT96 | TIMESTAMP
|
+| BINARY | BYTES
|
+| FLOAT | FLOAT
|
+| DOUBLE | DOUBLE
|
+| BOOLEAN | BOOLEAN
|
+| FIXED_LEN_BYTE_ARRAY | TIMESTAMP<br/> DECIMAL
|
+| DECIMAL | DECIMAL
|
+| LIST(STRING) | STRING_ARRAY_TYPE
|
+| LIST(BOOLEAN) | BOOLEAN_ARRAY_TYPE
|
+| LIST(TINYINT) | BYTE_ARRAY_TYPE
|
+| LIST(SMALLINT) | SHORT_ARRAY_TYPE
|
+| LIST(INT) | INT_ARRAY_TYPE
|
+| LIST(BIGINT) | LONG_ARRAY_TYPE
|
+| LIST(FLOAT) | FLOAT_ARRAY_TYPE
|
+| LIST(DOUBLE) | DOUBLE_ARRAY_TYPE
|
+| Map<K,V> | MapType, This type of K and V will transform to
SeaTunnel type |
+| STRUCT | SeaTunnelRowType
|
+
+## Options
+
+| name | type | required | default value |
Description
|
+|---------------------------|---------|----------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path | string | yes | - | The
Oss path that needs to be read can have sub paths, but the sub paths need to
meet certain format requirements. Specific requirements can be referred to
"parse_partition_from_path" option
|
+| file_format_type | string | yes | - | File
type, supported as the following file types: `text` `csv` `parquet` `orc`
`json` `excel`
|
+| bucket | string | yes | - | The
bucket address of oss file system, for example: `oss://seatunnel-test`.
|
+| endpoint | string | yes | - | fs
oss endpoint
|
+| read_columns | list | no | - | The
read column list of the data source, user can use it to implement field
projection. The file type supported column projection as the following shown:
`text` `csv` `parquet` `orc` `json` `excel` . If the user wants to use this
feature when reading `text` `json` `csv` files, the "schema" option must be
configured. |
+| access_key | string | no | - |
|
+| access_secret | string | no | - |
|
+| delimiter | string | no | \001 | Field
delimiter, used to tell connector how to slice and dice fields when reading
text files. Default `\001`, the same as hive's default delimiter.
|
+| parse_partition_from_path | boolean | no | true |
Control whether parse the partition keys and values from file path. For example
if you read a file from path
`oss://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every
record data from file will be added these two fields: name="tyrantlucifer",
age=16 |
+| date_format | string | no | yyyy-MM-dd | Date
type format, used to tell connector how to convert string to date, supported as
the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`. default
`yyyy-MM-dd`
|
+| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
Datetime type format, used to tell connector how to convert string to datetime,
supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss`
`yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss`
|
+| time_format | string | no | HH:mm:ss | Time
type format, used to tell connector how to convert string to time, supported as
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`
|
+| skip_header_row_number | long | no | 0 | Skip
the first few lines, but only for the txt and csv. For example, set like
following:`skip_header_row_number = 2`. Then SeaTunnel will skip the first 2
lines from source files
|
+| schema | config | no | - | The
schema of upstream data.
|
+| sheet_name | string | no | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
+| compress_codec | string | no | none | Which
compress codec the files used.
|
+| file_filter_pattern | string | no | |
`*.txt` means you only need read the files end with `.txt`
|
+| common-options | config | no | - |
Source plugin common parameters, please refer to [Source Common
Options](common-options.md) for details.
|
+
+### compress_codec [string]
+
+The compress codec of files and the details that supported as the following
shown:
+
+- txt: `lzo` `none`
+- json: `lzo` `none`
+- csv: `lzo` `none`
+- orc/parquet:
+ automatically recognizes the compression type, no additional settings
required.
+
+### file_filter_pattern [string]
+
+Filter pattern, which used for filtering files.
+
### schema [config]
Only need to be configured when the file_format_type are text, json, excel or
csv ( Or other format we can't read the schema from metadata).
@@ -254,6 +293,18 @@ sink {
}
```
+## Changelog
+
+### 2.2.0-beta 2022-09-26
+
+- Add OSS File Source Connector
+
+### 2.3.0-beta 2022-10-20
+
+- [BugFix] Fix the bug of incorrect path in windows environment
([2980](https://github.com/apache/seatunnel/pull/2980))
+- [Improve] Support extract partition from SeaTunnelRow fields
([3085](https://github.com/apache/seatunnel/pull/3085))
+- [Improve] Support parse field from file path
([2985](https://github.com/apache/seatunnel/pull/2985))
+
### Tips
> 1.[SeaTunnel Deployment Document](../../start-v2/locally/deployment.md).
diff --git a/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
b/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
index b25ce135f2..c904257b90 100644
--- a/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
+++ b/seatunnel-connectors-v2/connector-file/connector-file-oss/pom.xml
@@ -30,7 +30,9 @@
<name>SeaTunnel : Connectors V2 : File : Oss</name>
<properties>
- <hadoop-aliyun.version>2.9.2</hadoop-aliyun.version>
+ <aliyun.sdk.oss.version>3.4.1</aliyun.sdk.oss.version>
+ <hadoop-aliyun.version>3.1.4</hadoop-aliyun.version>
+ <jdom.version>1.1</jdom.version>
</properties>
<dependencies>
@@ -41,8 +43,10 @@
<version>${project.version}</version>
</dependency>
<dependency>
- <groupId>org.apache.flink</groupId>
- <artifactId>flink-shaded-hadoop-2</artifactId>
+ <groupId>org.apache.seatunnel</groupId>
+ <artifactId>seatunnel-hadoop3-3.1.4-uber</artifactId>
+ <version>${project.version}</version>
+ <classifier>optional</classifier>
<scope>provided</scope>
<exclusions>
<exclusion>
@@ -51,10 +55,23 @@
</exclusion>
</exclusions>
</dependency>
+ <dependency>
+ <groupId>com.aliyun.oss</groupId>
+ <artifactId>aliyun-sdk-oss</artifactId>
+ <version>${aliyun.sdk.oss.version}</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.jdom</groupId>
+ <artifactId>jdom</artifactId>
+ <version>${jdom.version}</version>
+ <scope>provided</scope>
+ </dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aliyun</artifactId>
<version>${hadoop-aliyun.version}</version>
+ <scope>provided</scope>
</dependency>
</dependencies>
diff --git a/seatunnel-dist/pom.xml b/seatunnel-dist/pom.xml
index 2dcde33436..dac8d6978b 100644
--- a/seatunnel-dist/pom.xml
+++ b/seatunnel-dist/pom.xml
@@ -105,6 +105,10 @@
<hadoop-aws.version>3.1.4</hadoop-aws.version>
<aws-java-sdk.version>1.11.271</aws-java-sdk.version>
<netty-buffer.version>4.1.89.Final</netty-buffer.version>
+ <hive.exec.version>2.3.9</hive.exec.version>
+ <hive.jdbc.version>3.1.3</hive.jdbc.version>
+ <aliyun.sdk.oss.version>3.4.1</aliyun.sdk.oss.version>
+ <jdom.version>1.1</jdom.version>
</properties>
<dependencies>
<!-- starters -->
@@ -638,14 +642,12 @@
<version>${hadoop-aliyun.version}</version>
<scope>provided</scope>
</dependency>
-
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop-aws.version}</version>
<scope>provided</scope>
</dependency>
-
<dependency>
<groupId>org.apache.seatunnel</groupId>
<artifactId>seatunnel-hadoop3-3.1.4-uber</artifactId>
@@ -653,6 +655,24 @@
<classifier>optional</classifier>
<scope>provided</scope>
</dependency>
+ <dependency>
+ <groupId>com.aliyun.oss</groupId>
+ <artifactId>aliyun-sdk-oss</artifactId>
+ <version>${aliyun.sdk.oss.version}</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.jdom</groupId>
+ <artifactId>jdom</artifactId>
+ <version>${jdom.version}</version>
+ <scope>provided</scope>
+ </dependency>
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-aliyun</artifactId>
+ <version>${hadoop-aliyun.version}</version>
+ <scope>provided</scope>
+ </dependency>
<!-- hadoop jar end -->
</dependencies>
<repositories>
@@ -700,6 +720,11 @@
<groupId>org.apache.seatunnel</groupId>
<artifactId>seatunnel-spark-3-starter</artifactId>
<version>${project.version}</version>
+ </dependency>
+ <dependency>
+ <groupId>com.amazonaws</groupId>
+ <artifactId>aws-java-sdk-bundle</artifactId>
+ <version>1.11.271</version>
<scope>provided</scope>
</dependency>
<!-- seatunnel connectors for demo -->
@@ -724,7 +749,6 @@
<scope>provided</scope>
</dependency>
- <!-- hadoop jar -->
<dependency>
<groupId>org.apache.seatunnel</groupId>
<artifactId>seatunnel-hadoop3-3.1.4-uber</artifactId>