This is an automated email from the ASF dual-hosted git repository.
liugddx pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git
The following commit(s) were added to refs/heads/dev by this push:
new de9a3243c8 [Docs][Connector-V2][HDFS]Refactor connector-v2 docs using
unified format HDFS. (#4871)
de9a3243c8 is described below
commit de9a3243c89dbd1cc1d3c477b25745a8dbc3c2c7
Author: lightzhao <[email protected]>
AuthorDate: Mon Aug 14 17:08:00 2023 +0800
[Docs][Connector-V2][HDFS]Refactor connector-v2 docs using unified format
HDFS. (#4871)
* Refactor connector-v2 docs using unified format HDFS.
* add data type.
* update.
* add key feature.
* add hdfs_site_path
* 1.add data type.
2.add hdfs_site_path conf.
* add data type.
* add hdfs site conf.
---------
Co-authored-by: lightzhao <[email protected]>
Co-authored-by: liuli <[email protected]>
---
docs/en/connector-v2/sink/HdfsFile.md | 326 ++++++++++++--------------------
docs/en/connector-v2/source/HdfsFile.md | 307 ++++++------------------------
2 files changed, 185 insertions(+), 448 deletions(-)
diff --git a/docs/en/connector-v2/sink/HdfsFile.md
b/docs/en/connector-v2/sink/HdfsFile.md
index 34ce19714b..135c5115c2 100644
--- a/docs/en/connector-v2/sink/HdfsFile.md
+++ b/docs/en/connector-v2/sink/HdfsFile.md
@@ -1,20 +1,14 @@
# HdfsFile
-> HDFS file sink connector
+> HDFS File Sink Connector
-## Description
-
-Output data to hdfs file
-
-:::tip
-
-If you use spark/flink, In order to use this connector, You must ensure your
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when
you download and install SeaTunnel Engine. You can check the jar package under
${SEATUNNEL_HOME}/lib to confirm this.
+## Support Those Engines
-:::
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
-## Key features
+## Key Features
- [x] [exactly-once](../../concept/connector-v2-features.md)
@@ -30,183 +24,120 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] compress codec
- [x] lzo
-## Options
-
-| name | type | required |
default value | remarks
|
-|----------------------------------|---------|----------|--------------------------------------------|-----------------------------------------------------------|
-| fs.defaultFS | string | yes | -
|
|
-| path | string | yes | -
|
|
-| hdfs_site_path | string | no | -
|
|
-| custom_filename | boolean | no | false
| Whether you need custom the filename
|
-| file_name_expression | string | no | "${transactionId}"
| Only used when custom_filename is true
|
-| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when custom_filename is true
|
-| file_format_type | string | no | "csv"
|
|
-| field_delimiter | string | no | '\001'
| Only used when file_format_type is text
|
-| row_delimiter | string | no | "\n"
| Only used when file_format_type is text
|
-| have_partition | boolean | no | false
| Whether you need processing partitions.
|
-| partition_by | array | no | -
| Only used then have_partition is true
|
-| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true |
-| is_partition_field_write_in_file | boolean | no | false
| Only used then have_partition is true
|
-| sink_columns | array | no |
| When this parameter is empty, all fields are sink
columns |
-| is_enable_transaction | boolean | no | true
|
|
-| batch_size | int | no | 1000000
|
|
-| compress_codec | string | no | none
|
|
-| kerberos_principal | string | no | -
|
-| kerberos_keytab_path | string | no | -
|
|
-| compress_codec | string | no | none
|
|
-| common-options | object | no | -
|
|
-| max_rows_in_memory | int | no | -
| Only used when file_format_type is excel.
|
-| sheet_name | string | no | Sheet${Random
number} | Only used when file_format_type is excel.
|
-
-### fs.defaultFS [string]
-
-The hadoop cluster address that start with `hdfs://`, for example:
`hdfs://hadoopcluster`
-
-### path [string]
-
-The target dir path is required.
-
-### hdfs_site_path [string]
-
-The path of `hdfs-site.xml`, used to load ha configuration of namenodes
-
-### custom_filename [boolean]
-
-Whether custom the filename
-
-### file_name_expression [string]
-
-Only used when `custom_filename` is `true`
-
-`file_name_expression` describes the file expression which will be created
into the `path`. We can add the variable `${now}` or `${uuid}` in the
`file_name_expression`, like `test_${uuid}_${now}`,
-`${now}` represents the current time, and its format can be defined by
specifying the option `filename_time_format`.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add
`${transactionId}_` in the head of the file.
-
-### filename_time_format [string]
-
-Only used when `custom_filename` is `true`
-
-When the format in the `file_name_expression` parameter is `xxxx-${now}` ,
`filename_time_format` can specify the time format of the path, and the default
value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
-
-| Symbol | Description |
-|--------|--------------------|
-| y | Year |
-| M | Month |
-| d | Day of month |
-| H | Hour in day (0-23) |
-| m | Minute in hour |
-| s | Second in minute |
-
-### file_format_type [string]
-
-We supported as the following file types:
-
-`text` `json` `csv` `orc` `parquet` `excel`
-
-Please note that, The final file name will end with the file_format_type's
suffix, the suffix of the text file is `txt`.
-
-### field_delimiter [string]
-
-The separator between columns in a row of data. Only needed by `text` file
format.
-
-### row_delimiter [string]
-
-The separator between rows in a file. Only needed by `text` file format.
-
-### have_partition [boolean]
-
-Whether you need processing partitions.
-
-### partition_by [array]
-
-Only used when `have_partition` is `true`.
-
-Partition data based on selected fields.
-
-### partition_dir_expression [string]
-
-Only used when `have_partition` is `true`.
-
-If the `partition_by` is specified, we will generate the corresponding
partition directory based on the partition information, and the final file will
be placed in the partition directory.
-
-Default `partition_dir_expression` is
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field
and `v0` is the value of the first partition field.
-
-### is_partition_field_write_in_file [boolean]
-
-Only used when `have_partition` is `true`.
-
-If `is_partition_field_write_in_file` is `true`, the partition field and the
value of it will be write into data file.
-
-For example, if you want to write a Hive Data File, Its value should be
`false`.
-
-### sink_columns [array]
-
-Which columns need be write to file, default value is all of the columns get
from `Transform` or `Source`.
-The order of the fields determines the order in which the file is actually
written.
-
-### is_enable_transaction [boolean]
-
-If `is_enable_transaction` is true, we will ensure that data will not be lost
or duplicated when it is written to the target directory.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add
`${transactionId}_` in the head of the file.
-
-Only support `true` now.
-
-### batch_size [int]
-
-The maximum number of rows in a file. For SeaTunnel Engine, the number of
lines in the file is determined by `batch_size` and `checkpoint.interval`
jointly decide. If the value of `checkpoint.interval` is large enough, sink
writer will write rows in a file until the rows in the file larger than
`batch_size`. If `checkpoint.interval` is small, the sink writer will create a
new file when a new checkpoint trigger.
-
-### compress_codec [string]
-
-The compress codec of files and the details that supported as the following
shown:
-
-- txt: `lzo` `none`
-- json: `lzo` `none`
-- csv: `lzo` `none`
-- orc: `lzo` `snappy` `lz4` `zlib` `none`
-- parquet: `lzo` `snappy` `lz4` `gzip` `brotli` `zstd` `none`
-
-Tips: excel type does not support any compression format
-
-### kerberos_principal [string]
-
-The principal of kerberos
-
-### kerberos_keytab_path [string]
-
-The keytab path of kerberos
-
-### common options
-
-Sink plugin common parameters, please refer to [Sink Common
Options](common-options.md) for details
+## Description
-### max_rows_in_memory [int]
+Output data to hdfs file
-When File Format is Excel,The maximum number of data items that can be cached
in the memory.
+## Supported DataSource Info
+
+| Datasource | Supported Versions |
+|------------|--------------------|
+| HdfsFile | hadoop 2.x and 3.x |
+
+## Sink Options
+
+| Name | Type | Required |
Default |
Description
[...]
+|----------------------------------|---------|----------|--------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| fs.defaultFS | string | yes | -
| The hadoop cluster address that start with `hdfs://`,
for example: `hdfs://hadoopcluster`
[...]
+| path | string | yes | -
| The target dir path is required.
[...]
+| hdfs_site_path | string | no | -
| The path of `hdfs-site.xml`, used to load ha
configuration of namenodes
[...]
+| custom_filename | boolean | no | false
| Whether you need custom the filename
[...]
+| file_name_expression | string | no | "${transactionId}"
| Only used when `custom_filename` is
`true`.`file_name_expression` describes the file expression which will be
created into the `path`. We can add the variable `${now}` or `${uuid}` in the
`file_name_expression`, like `test_${uuid}_${now}`,`${now}` represents the
current time, and its format can be defined by specifying the option
`filename_time_format`.Please note that, If `is_enable_tr [...]
+| filename_time_format | string | no | "yyyy.MM.dd"
| Only used when `custom_filename` is `true`.When the
format in the `file_name_expression` parameter is `xxxx-${now}` ,
`filename_time_format` can specify the time format of the path, and the default
value is `yyyy.MM.dd` . The commonly used time formats are listed as
follows:[y:Year,M:Month,d:Day of month,H:Hour in day (0-23),m:Minute in
hour,s:Second in minute] [...]
+| file_format_type | string | no | "csv"
| We supported as the following file types:`text` `json`
`csv` `orc` `parquet` `excel`.Please note that, The final file name will end
with the file_format's suffix, the suffix of the text file is `txt`.
[...]
+| field_delimiter | string | no | '\001'
| Only used when file_format is text,The separator
between columns in a row of data. Only needed by `text` file format.
[...]
+| row_delimiter | string | no | "\n"
| Only used when file_format is text,The separator
between rows in a file. Only needed by `text` file format.
[...]
+| have_partition | boolean | no | false
| Whether you need processing partitions.
[...]
+| partition_by | array | no | -
| Only used then have_partition is true,Partition data
based on selected fields.
[...]
+| partition_dir_expression | string | no |
"${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is
true,If the `partition_by` is specified, we will generate the corresponding
partition directory based on the partition information, and the final file will
be placed in the partition directory. Default `partition_dir_expression` is
`${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field
and `v0` is the value of the first partition f [...]
+| is_partition_field_write_in_file | boolean | no | false
| Only used when `have_partition` is `true`. If
`is_partition_field_write_in_file` is `true`, the partition field and the value
of it will be write into data file.For example, if you want to write a Hive
Data File, Its value should be `false`.
[...]
+| sink_columns | array | no |
| When this parameter is empty, all fields are sink
columns.Which columns need be write to file, default value is all of the
columns get from `Transform` or `Source`. The order of the fields determines
the order in which the file is actually written.
[...]
+| is_enable_transaction | boolean | no | true
| If `is_enable_transaction` is true, we will ensure that
data will not be lost or duplicated when it is written to the target
directory.Please note that, If `is_enable_transaction` is `true`, we will auto
add `${transactionId}_` in the head of the file.Only support `true` now.
[...]
+| batch_size | int | no | 1000000
| The maximum number of rows in a file. For SeaTunnel
Engine, the number of lines in the file is determined by `batch_size` and
`checkpoint.interval` jointly decide. If the value of `checkpoint.interval` is
large enough, sink writer will write rows in a file until the rows in the file
larger than `batch_size`. If `checkpoint.interval` is small, the sink writer
will create a new file when [...]
+| compress_codec | string | no | none
| The compress codec of files and the details that
supported as the following shown:[txt: `lzo` `none`,json: `lzo` `none`,csv:
`lzo` `none`,orc: `lzo` `snappy` `lz4` `zlib` `none`,parquet: `lzo` `snappy`
`lz4` `gzip` `brotli` `zstd` `none`].Tips: excel type does not support any
compression format.
[...]
+| kerberos_principal | string | no | -
| The principal of kerberos
[...]
+| kerberos_keytab_path | string | no | -
| The keytab path of kerberos
[...]
+| compress_codec | string | no | none
| compress codec
[...]
+| common-options | object | no | -
| Sink plugin common parameters, please refer to [Sink
Common Options](common-options.md) for details
[...]
+| max_rows_in_memory | int | no | -
| Only used when file_format is excel.When File Format is
Excel,The maximum number of data items that can be cached in the memory.
[...]
+| sheet_name | string | no | Sheet${Random
number} | Only used when file_format is excel.Writer the
sheet of the workbook
[...]
+
+### Tips
+
+> If you use spark/flink, In order to use this connector, You must ensure your
spark/flink cluster already integrated hadoop. The tested hadoop version is
2.x. If you use SeaTunnel Engine, It automatically integrated the hadoop jar
when you download and install SeaTunnel Engine. You can check the jar package
under ${SEATUNNEL_HOME}/lib to confirm this.
+
+## Task Example
+
+### Simple:
+
+> This example defines a SeaTunnel synchronization task that automatically
generates data through FakeSource and sends it to Hdfs.
-### sheet_name [string]
+```
+# Defining the runtime environment
+env {
+ # You can set flink configuration here
+ execution.parallelism = 1
+ job.mode = "BATCH"
+}
-Writer the sheet of the workbook
+source {
+ # This is a example source plugin **only for test and demonstrate the
feature source plugin**
+ FakeSource {
+ parallelism = 1
+ result_table_name = "fake"
+ row.num = 16
+ schema = {
+ fields {
+ c_map = "map<string, smallint>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ }
+ }
+ }
+ # If you would like to get more information about how to configure seatunnel
and see full list of source plugins,
+ # please go to https://seatunnel.apache.org/docs/category/source-v2
+}
-## Example
+transform {
+ # If you would like to get more information about how to configure seatunnel
and see full list of transform plugins,
+ # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
-For orc file format simple config
+sink {
+ HdfsFile {
+ fs.defaultFS = "hdfs://hadoopcluster"
+ path = "/tmp/hive/warehouse/test2"
+ file_format = "orc"
+ }
+ # If you would like to get more information about how to configure seatunnel
and see full list of sink plugins,
+ # please go to https://seatunnel.apache.org/docs/category/sink-v2
+}
+```
-```bash
+### For orc file format simple config
+```
HdfsFile {
fs.defaultFS = "hdfs://hadoopcluster"
path = "/tmp/hive/warehouse/test2"
- file_format_type = "orc"
+ file_format = "orc"
}
-
```
-For text file format with `have_partition` and `custom_filename` and
`sink_columns`
-
-```bash
+### For text file format with `have_partition` and `custom_filename` and
`sink_columns`
+```
HdfsFile {
fs.defaultFS = "hdfs://hadoopcluster"
path = "/tmp/hive/warehouse/test2"
@@ -223,13 +154,11 @@ HdfsFile {
sink_columns = ["name","age"]
is_enable_transaction = true
}
-
```
-For parquet file format with `have_partition` and `custom_filename` and
`sink_columns`
-
-```bash
+### For parquet file format with `have_partition` and `custom_filename` and
`sink_columns`
+```
HdfsFile {
fs.defaultFS = "hdfs://hadoopcluster"
path = "/tmp/hive/warehouse/test2"
@@ -244,32 +173,27 @@ HdfsFile {
sink_columns = ["name","age"]
is_enable_transaction = true
}
-
```
-## Changelog
+### For kerberos simple config
-### 2.2.0-beta 2022-09-26
-
-- Add HDFS File Sink Connector
-
-### 2.3.0-beta 2022-10-20
-
-- [BugFix] Fix the bug of incorrect path in windows environment
([2980](https://github.com/apache/seatunnel/pull/2980))
-- [BugFix] Fix filesystem get error
([3117](https://github.com/apache/seatunnel/pull/3117))
-- [BugFix] Solved the bug of can not parse '\t' as delimiter from config file
([3083](https://github.com/apache/seatunnel/pull/3083))
-
-### 2.3.0 2022-12-30
-
-- [BugFix] Fixed the following bugs that failed to write data to files
([3258](https://github.com/apache/seatunnel/pull/3258))
- - When field from upstream is null it will throw NullPointerException
- - Sink columns mapping failed
- - When restore writer from states getting transaction directly failed
+```
+HdfsFile {
+ fs.defaultFS = "hdfs://hadoopcluster"
+ path = "/tmp/hive/warehouse/test2"
+ hdfs_site_path = "/path/to/your/hdfs_site_path"
+ kerberos_principal = "[email protected]"
+ kerberos_keytab_path = "/path/to/your/keytab/file.keytab"
+}
+```
-### Next version
+### For compress simple config
-- [Improve] Support setting batch size for every file
([3625](https://github.com/apache/seatunnel/pull/3625))
-- [Improve] Support lzo compression for text in file format
([3782](https://github.com/apache/seatunnel/pull/3782))
-- [Improve] Support kerberos authentication
([3840](https://github.com/apache/seatunnel/pull/3840))
-- [Improve] Support file compress
([3899](https://github.com/apache/seatunnel/pull/3899))
+```
+HdfsFile {
+ fs.defaultFS = "hdfs://hadoopcluster"
+ path = "/tmp/hive/warehouse/test2"
+ compress_codec = "lzo"
+}
+```
diff --git a/docs/en/connector-v2/source/HdfsFile.md
b/docs/en/connector-v2/source/HdfsFile.md
index f479e40a2b..88c1e35f87 100644
--- a/docs/en/connector-v2/source/HdfsFile.md
+++ b/docs/en/connector-v2/source/HdfsFile.md
@@ -1,20 +1,14 @@
# HdfsFile
-> Hdfs file source connector
+> Hdfs File Source Connector
-## Description
-
-Read data from hdfs file system.
-
-:::tip
+## Support Those Engines
-If you use spark/flink, In order to use this connector, You must ensure your
spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
-If you use SeaTunnel Engine, It automatically integrated the hadoop jar when
you download and install SeaTunnel Engine. You can check the jar package under
${SEATUNNEL_HOME}/lib to confirm this.
-
-:::
-
-## Key features
+## Key Features
- [x] [batch](../../concept/connector-v2-features.md)
- [ ] [stream](../../concept/connector-v2-features.md)
@@ -33,238 +27,57 @@ Read all the data in a split in a pollNext call. What
splits are read will be sa
- [x] json
- [x] excel
-## Options
-
-| name | type | required | default value |
-|---------------------------|---------|----------|---------------------|
-| path | string | yes | - |
-| file_format_type | string | yes | - |
-| fs.defaultFS | string | yes | - |
-| read_columns | list | yes | - |
-| hdfs_site_path | string | no | - |
-| delimiter | string | no | \001 |
-| parse_partition_from_path | boolean | no | true |
-| date_format | string | no | yyyy-MM-dd |
-| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
-| time_format | string | no | HH:mm:ss |
-| kerberos_principal | string | no | - |
-| kerberos_keytab_path | string | no | - |
-| skip_header_row_number | long | no | 0 |
-| schema | config | no | - |
-| common-options | | no | - |
-| sheet_name | string | no | - |
-| file_filter_pattern | string | no | - |
-
-### path [string]
-
-The source file path.
-
-### delimiter [string]
-
-Field delimiter, used to tell connector how to slice and dice fields when
reading text files
-
-default `\001`, the same as hive's default delimiter
-
-### parse_partition_from_path [boolean]
-
-Control whether parse the partition keys and values from file path
-
-For example if you read a file from path
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`
-
-Every record data from file will be added these two fields:
-
-| name | age |
-|---------------|-----|
-| tyrantlucifer | 26 |
-
-Tips: **Do not define partition fields in schema option**
-
-### date_format [string]
-
-Date type format, used to tell connector how to convert string to date,
supported as the following formats:
-
-`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
-
-default `yyyy-MM-dd`
-
-### datetime_format [string]
-
-Datetime type format, used to tell connector how to convert string to
datetime, supported as the following formats:
-
-`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss` `yyyy/MM/dd HH:mm:ss`
`yyyyMMddHHmmss`
-
-default `yyyy-MM-dd HH:mm:ss`
-
-### time_format [string]
-
-Time type format, used to tell connector how to convert string to time,
supported as the following formats:
-
-`HH:mm:ss` `HH:mm:ss.SSS`
-
-default `HH:mm:ss`
-
-### skip_header_row_number [long]
-
-Skip the first few lines, but only for the txt and csv.
-
-For example, set like following:
-
-`skip_header_row_number = 2`
-
-then SeaTunnel will skip the first 2 lines from source files
-
-### file_format_type [string]
-
-File type, supported as the following file types:
-
-`text` `csv` `parquet` `orc` `json` `excel`
-
-If you assign file type to `json`, you should also assign schema option to
tell connector how to parse data to the row you want.
-
-For example:
-
-upstream data is the following:
-
-```json
-
-{"code": 200, "data": "get success", "success": true}
-
-```
-
-You can also save multiple pieces of data in one file and split them by
newline:
-
-```json lines
-
-{"code": 200, "data": "get success", "success": true}
-{"code": 300, "data": "get failed", "success": false}
-
-```
-
-you should assign schema as the following:
-
-```hocon
-
-schema {
- fields {
- code = int
- data = string
- success = boolean
- }
-}
-
-```
-
-connector will generate data as the following:
-
-| code | data | success |
-|------|-------------|---------|
-| 200 | get success | true |
-
-If you assign file type to `parquet` `orc`, schema option not required,
connector can find the schema of upstream data automatically.
-
-If you assign file type to `text` `csv`, you can choose to specify the schema
information or not.
+## Description
-For example, upstream data is the following:
+Read data from hdfs file system.
-```text
+## Supported DataSource Info
-tyrantlucifer#26#male
+| Datasource | Supported Versions |
+|------------|--------------------|
+| HdfsFile | hadoop 2.x and 3.x |
-```
+## Source Options
-If you do not assign data schema connector will treat the upstream data as the
following:
+| Name | Type | Required | Default |
Description
|
+|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| path | string | yes | - | The
source file path.
|
+| file_format_type | string | yes | - | We
supported as the following file types:`text` `json` `csv` `orc` `parquet`
`excel`.Please note that, The final file name will end with the file_format's
suffix, the suffix of the text file is `txt`.
|
+| fs.defaultFS | string | yes | - | The
hadoop cluster address that start with `hdfs://`, for example:
`hdfs://hadoopcluster`
|
+| read_columns | list | yes | - | The
read column list of the data source, user can use it to implement field
projection.The file type supported column projection as the following
shown:[text,json,csv,orc,parquet,excel].Tips: If the user wants to use this
feature when reading `text` `json` `csv` files, the schema option must be
configured. |
+| hdfs_site_path | string | no | - | The
path of `hdfs-site.xml`, used to load ha configuration of namenodes
|
+| delimiter | string | no | \001 | Field
delimiter, used to tell connector how to slice and dice fields when reading
text files. default `\001`, the same as hive's default delimiter
|
+| parse_partition_from_path | boolean | no | true |
Control whether parse the partition keys and values from file path. For example
if you read a file from path
`hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26`. Every
record data from file will be added these two
fields:[name:tyrantlucifer,age:26].Tips:Do not define partition fields in
schema option. |
+| date_format | string | no | yyyy-MM-dd | Date
type format, used to tell connector how to convert string to date, supported as
the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd` default
`yyyy-MM-dd`.Date type format, used to tell connector how to convert string to
date, supported as the following formats:`yyyy-MM-dd` `yyyy.MM.dd` `yyyy/MM/dd`
default `yyyy-MM-dd` |
+| datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
Datetime type format, used to tell connector how to convert string to datetime,
supported as the following formats:`yyyy-MM-dd HH:mm:ss` `yyyy.MM.dd HH:mm:ss`
`yyyy/MM/dd HH:mm:ss` `yyyyMMddHHmmss` .default `yyyy-MM-dd HH:mm:ss`
|
+| time_format | string | no | HH:mm:ss | Time
type format, used to tell connector how to convert string to time, supported as
the following formats:`HH:mm:ss` `HH:mm:ss.SSS`.default `HH:mm:ss`
|
+| kerberos_principal | string | no | - | The
principal of kerberos
|
+| kerberos_keytab_path | string | no | - | The
keytab path of kerberos
|
+| skip_header_row_number | long | no | 0 | Skip
the first few lines, but only for the txt and csv.For example, set like
following:`skip_header_row_number = 2`.then Seatunnel will skip the first 2
lines from source files
|
+| schema | config | no | - | the
schema fields of upstream data
|
+| common-options | | no | - |
Source plugin common parameters, please refer to [Source Common
Options](common-options.md) for details.
|
+| sheet_name | string | no | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
-| content |
-|-----------------------|
-| tyrantlucifer#26#male |
+### Tips
-If you assign data schema, you should also assign the option `delimiter` too
except CSV file type
+> If you use spark/flink, In order to use this connector, You must ensure your
spark/flink cluster already integrated hadoop. The tested hadoop version is
2.x. If you use SeaTunnel Engine, It automatically integrated the hadoop jar
when you download and install SeaTunnel Engine. You can check the jar package
under ${SEATUNNEL_HOME}/lib to confirm this.
-you should assign schema and delimiter as the following:
+## Task Example
-```hocon
+### Simple:
-delimiter = "#"
-schema {
- fields {
- name = string
- age = int
- gender = string
- }
-}
+> This example defines a SeaTunnel synchronization task that read data from
Hdfs and sends it to Hdfs.
```
-
-connector will generate data as the following:
-
-| name | age | gender |
-|---------------|-----|--------|
-| tyrantlucifer | 26 | male |
-
-### fs.defaultFS [string]
-
-Hdfs cluster address.
-
-### hdfs_site_path [string]
-
-The path of `hdfs-site.xml`, used to load ha configuration of namenodes
-
-### kerberos_principal [string]
-
-The principal of kerberos
-
-### kerberos_keytab_path [string]
-
-The keytab path of kerberos
-
-### schema [Config]
-
-#### fields [Config]
-
-the schema fields of upstream data
-
-### read_columns [list]
-
-The read column list of the data source, user can use it to implement field
projection.
-
-The file type supported column projection as the following shown:
-
-- text
-- json
-- csv
-- orc
-- parquet
-- excel
-
-**Tips: If the user wants to use this feature when reading `text` `json` `csv`
files, the schema option must be configured**
-
-### common options
-
-Source plugin common parameters, please refer to [Source Common
Options](common-options.md) for details.
-
-### sheet_name [string]
-
-Reader the sheet of the workbook,Only used when file_format_type is excel.
-
-### file_filter_pattern [string]
-
-Filter pattern, which used for filtering files.
-
-## Example
-
-```hocon
-
-HdfsFile {
- path = "/apps/hive/demo/student"
- file_format_type = "parquet"
- fs.defaultFS = "hdfs://namenode001"
+# Defining the runtime environment
+env {
+ # You can set flink configuration here
+ execution.parallelism = 1
+ job.mode = "BATCH"
}
-```
-
-```hocon
-
-HdfsFile {
+source {
+ HdfsFile {
schema {
fields {
name = string
@@ -274,24 +87,24 @@ HdfsFile {
path = "/apps/hive/demo/student"
type = "json"
fs.defaultFS = "hdfs://namenode001"
+ }
+ # If you would like to get more information about how to configure seatunnel
and see full list of source plugins,
+ # please go to https://seatunnel.apache.org/docs/category/source-v2
}
-```
-
-## Changelog
-
-### 2.2.0-beta 2022-09-26
-
-- Add HDFS File Source Connector
-
-### 2.3.0-beta 2022-10-20
-
-- [BugFix] Fix the bug of incorrect path in windows environment
([2980](https://github.com/apache/seatunnel/pull/2980))
-- [Improve] Support extract partition from SeaTunnelRow fields
([3085](https://github.com/apache/seatunnel/pull/3085))
-- [Improve] Support parse field from file path
([2985](https://github.com/apache/seatunnel/pull/2985))
-
-### next version
+transform {
+ # If you would like to get more information about how to configure seatunnel
and see full list of transform plugins,
+ # please go to https://seatunnel.apache.org/docs/category/transform-v2
+}
-- [Improve] Support skip header for csv and txt files
([3900](https://github.com/apache/seatunnel/pull/3840))
-- [Improve] Support kerberos authentication
([3840](https://github.com/apache/seatunnel/pull/3840))
+sink {
+ HdfsFile {
+ fs.defaultFS = "hdfs://hadoopcluster"
+ path = "/tmp/hive/warehouse/test2"
+ file_format = "orc"
+ }
+ # If you would like to get more information about how to configure seatunnel
and see full list of sink plugins,
+ # please go to https://seatunnel.apache.org/docs/category/sink-v2
+}
+```