This is an automated email from the ASF dual-hosted git repository.
wanghailin pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git
The following commit(s) were added to refs/heads/dev by this push:
new a2590e8ee4 [Improve][Doc] Add `file_filter_pattern` example to doc
(#7922)
a2590e8ee4 is described below
commit a2590e8ee4855cda351d06d2d69bd534d565a57f
Author: YOMO LEE <[email protected]>
AuthorDate: Tue Oct 29 20:24:04 2024 +0800
[Improve][Doc] Add `file_filter_pattern` example to doc (#7922)
---
docs/en/connector-v2/source/CosFile.md | 80 ++++++++++++++++++++++++++-
docs/en/connector-v2/source/FtpFile.md | 80 +++++++++++++++++++++++++++
docs/en/connector-v2/source/HdfsFile.md | 79 ++++++++++++++++++++++++++-
docs/en/connector-v2/source/LocalFile.md | 77 +++++++++++++++++++++++++-
docs/en/connector-v2/source/OssFile.md | 80 ++++++++++++++++++++++++++-
docs/en/connector-v2/source/OssJindoFile.md | 80 ++++++++++++++++++++++++++-
docs/en/connector-v2/source/S3File.md | 83 ++++++++++++++++++++++++++++-
docs/en/connector-v2/source/SftpFile.md | 82 +++++++++++++++++++++++++++-
docs/zh/connector-v2/source/HdfsFile.md | 79 ++++++++++++++++++++++++++-
9 files changed, 708 insertions(+), 12 deletions(-)
diff --git a/docs/en/connector-v2/source/CosFile.md
b/docs/en/connector-v2/source/CosFile.md
index 702439c306..15b6de0c6f 100644
--- a/docs/en/connector-v2/source/CosFile.md
+++ b/docs/en/connector-v2/source/CosFile.md
@@ -45,7 +45,7 @@ To use this connector you need put
hadoop-cos-{hadoop.version}-{version}.jar and
## Options
-| name | type | required | default value |
+| name | type | required | default value |
|---------------------------|---------|----------|---------------------|
| path | string | yes | - |
| file_format_type | string | yes | - |
@@ -64,7 +64,7 @@ To use this connector you need put
hadoop-cos-{hadoop.version}-{version}.jar and
| sheet_name | string | no | - |
| xml_row_tag | string | no | - |
| xml_use_attr_format | boolean | no | - |
-| file_filter_pattern | string | no | - |
+| file_filter_pattern | string | no | |
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
@@ -275,6 +275,55 @@ Specifies Whether to process data using the tag attribute
format.
Filter pattern, which used for filtering files.
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### compress_codec [string]
The compress codec of files and the details that supported as the following
shown:
@@ -372,6 +421,33 @@ sink {
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ CosFile {
+ bucket = "cosn://seatunnel-test-1259587829"
+ secret_id = "xxxxxxxxxxxxxxxxxxx"
+ secret_key = "xxxxxxxxxxxxxxxxxxx"
+ region = "ap-chengdu"
+ path = "/seatunnel/read/binary/"
+ file_format_type = "binary"
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
+
## Changelog
### next version
diff --git a/docs/en/connector-v2/source/FtpFile.md
b/docs/en/connector-v2/source/FtpFile.md
index ec02f77f9f..6d11481376 100644
--- a/docs/en/connector-v2/source/FtpFile.md
+++ b/docs/en/connector-v2/source/FtpFile.md
@@ -84,6 +84,59 @@ The target ftp password is required
The source file path.
+### file_filter_pattern [string]
+
+Filter pattern, which used for filtering files.
+
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### file_format_type [string]
File type, supported as the following file types:
@@ -400,6 +453,33 @@ sink {
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ FtpFile {
+ host = "192.168.31.48"
+ port = 21
+ user = tyrantlucifer
+ password = tianchao
+ path = "/seatunnel/read/binary/"
+ file_format_type = "binary"
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
+
## Changelog
### 2.2.0-beta 2022-09-26
diff --git a/docs/en/connector-v2/source/HdfsFile.md
b/docs/en/connector-v2/source/HdfsFile.md
index 7413c0428b..405dfff820 100644
--- a/docs/en/connector-v2/source/HdfsFile.md
+++ b/docs/en/connector-v2/source/HdfsFile.md
@@ -41,7 +41,7 @@ Read data from hdfs file system.
## Source Options
-| Name | Type | Required | Default |
Description
|
+| Name | Type | Required | Default |
Description
|
|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| path | string | yes | - | The
source file path.
|
| file_format_type | string | yes | - | We
supported as the following file types:`text` `csv` `parquet` `orc` `json`
`excel` `xml` `binary`.Please note that, The final file name will end with the
file_format's suffix, the suffix of the text file is `txt`.
|
@@ -62,6 +62,7 @@ Read data from hdfs file system.
| sheet_name | string | no | - |
Reader the sheet of the workbook,Only used when file_format is excel.
|
| xml_row_tag | string | no | - |
Specifies the tag name of the data rows within the XML file, only used when
file_format is xml.
|
| xml_use_attr_format | boolean | no | - |
Specifies whether to process data using the tag attribute format, only used
when file_format is xml.
|
+| file_filter_pattern | string | no | |
Filter pattern, which used for filtering files.
|
| compress_codec | string | no | none | The
compress codec of files
|
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
|
@@ -71,6 +72,59 @@ Read data from hdfs file system.
**delimiter** parameter will deprecate after version 2.3.5, please use
**field_delimiter** instead.
+### file_filter_pattern [string]
+
+Filter pattern, which used for filtering files.
+
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### compress_codec [string]
The compress codec of files and the details that supported as the following
shown:
@@ -146,3 +200,26 @@ sink {
}
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ HdfsFile {
+ path = "/apps/hive/demo/student"
+ file_format_type = "json"
+ fs.defaultFS = "hdfs://namenode001"
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
diff --git a/docs/en/connector-v2/source/LocalFile.md
b/docs/en/connector-v2/source/LocalFile.md
index 6d11b992e3..65f287f057 100644
--- a/docs/en/connector-v2/source/LocalFile.md
+++ b/docs/en/connector-v2/source/LocalFile.md
@@ -43,7 +43,7 @@ If you use SeaTunnel Engine, It automatically integrated the
hadoop jar when you
## Options
-| name | type | required | default value
|
+| name | type | required | default value
|
|---------------------------|---------|----------|--------------------------------------|
| path | string | yes | -
|
| file_format_type | string | yes | -
|
@@ -58,7 +58,7 @@ If you use SeaTunnel Engine, It automatically integrated the
hadoop jar when you
| sheet_name | string | no | -
|
| xml_row_tag | string | no | -
|
| xml_use_attr_format | boolean | no | -
|
-| file_filter_pattern | string | no | -
|
+| file_filter_pattern | string | no |
|
| compress_codec | string | no | none
|
| archive_compress_codec | string | no | none
|
| encoding | string | no | UTF-8
|
@@ -254,6 +254,55 @@ Specifies Whether to process data using the tag attribute
format.
Filter pattern, which used for filtering files.
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### compress_codec [string]
The compress codec of files and the details that supported as the following
shown:
@@ -406,6 +455,30 @@ sink {
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ LocalFile {
+ path = "/data/seatunnel/"
+ file_format_type = "csv"
+ skip_header_row_number = 1
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
+
## Changelog
### 2.2.0-beta 2022-09-26
diff --git a/docs/en/connector-v2/source/OssFile.md
b/docs/en/connector-v2/source/OssFile.md
index d5326cb86a..36d998f054 100644
--- a/docs/en/connector-v2/source/OssFile.md
+++ b/docs/en/connector-v2/source/OssFile.md
@@ -190,7 +190,7 @@ If you assign file type to `parquet` `orc`, schema option
not required, connecto
## Options
-| name | type | required | default value |
Description
|
+| name | type | required | default value |
Description
|
|---------------------------|---------|----------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| path | string | yes | - | The
Oss path that needs to be read can have sub paths, but the sub paths need to
meet certain format requirements. Specific requirements can be referred to
"parse_partition_from_path" option
|
| file_format_type | string | yes | - | File
type, supported as the following file types: `text` `csv` `parquet` `orc`
`json` `excel` `xml` `binary`
|
@@ -211,7 +211,7 @@ If you assign file type to `parquet` `orc`, schema option
not required, connecto
| xml_use_attr_format | boolean | no | - |
Specifies whether to process data using the tag attribute format, only used
when file_format is xml.
|
| compress_codec | string | no | none | Which
compress codec the files used.
|
| encoding | string | no | UTF-8 |
-| file_filter_pattern | string | no | |
`*.txt` means you only need read the files end with `.txt`
|
+| file_filter_pattern | string | no | |
Filter pattern, which used for filtering files.
|
| common-options | config | no | - |
Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
### compress_codec [string]
@@ -233,6 +233,55 @@ The encoding of the file to read. This param will be
parsed by `Charset.forName(
Filter pattern, which used for filtering files.
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### schema [config]
Only need to be configured when the file_format_type are text, json, excel,
xml or csv ( Or other format we can't read the schema from metadata).
@@ -474,6 +523,33 @@ sink {
}
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ OssFile {
+ path = "/seatunnel/orc"
+ bucket = "oss://tyrantlucifer-image-bed"
+ access_key = "xxxxxxxxxxxxxxxxx"
+ access_secret = "xxxxxxxxxxxxxxxxxxxxxx"
+ endpoint = "oss-cn-beijing.aliyuncs.com"
+ file_format_type = "orc"
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
+
## Changelog
### 2.2.0-beta 2022-09-26
diff --git a/docs/en/connector-v2/source/OssJindoFile.md
b/docs/en/connector-v2/source/OssJindoFile.md
index d5bd6d14fa..933439edc9 100644
--- a/docs/en/connector-v2/source/OssJindoFile.md
+++ b/docs/en/connector-v2/source/OssJindoFile.md
@@ -49,7 +49,7 @@ It only supports hadoop version **2.9.X+**.
## Options
-| name | type | required | default value |
+| name | type | required | default value |
|---------------------------|---------|----------|---------------------|
| path | string | yes | - |
| file_format_type | string | yes | - |
@@ -68,7 +68,7 @@ It only supports hadoop version **2.9.X+**.
| sheet_name | string | no | - |
| xml_row_tag | string | no | - |
| xml_use_attr_format | boolean | no | - |
-| file_filter_pattern | string | no | - |
+| file_filter_pattern | string | no | |
| compress_codec | string | no | none |
| archive_compress_codec | string | no | none |
| encoding | string | no | UTF-8 |
@@ -267,6 +267,55 @@ Reader the sheet of the workbook.
Filter pattern, which used for filtering files.
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### compress_codec [string]
The compress codec of files and the details that supported as the following
shown:
@@ -364,6 +413,33 @@ sink {
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ OssJindoFile {
+ bucket = "oss://tyrantlucifer-image-bed"
+ access_key = "xxxxxxxxxxxxxxxxx"
+ access_secret = "xxxxxxxxxxxxxxxxxxxxxx"
+ endpoint = "oss-cn-beijing.aliyuncs.com"
+ path = "/seatunnel/read/binary/"
+ file_format_type = "binary"
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
+
## Changelog
### next version
diff --git a/docs/en/connector-v2/source/S3File.md
b/docs/en/connector-v2/source/S3File.md
index d280d6dc7f..4834b025bc 100644
--- a/docs/en/connector-v2/source/S3File.md
+++ b/docs/en/connector-v2/source/S3File.md
@@ -196,7 +196,7 @@ If you assign file type to `parquet` `orc`, schema option
not required, connecto
## Options
-| name | type | required |
default value | Description
[...]
+| name | type | required | default value
| Description
[...]
|---------------------------------|---------|----------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
| path | string | yes | -
| The s3 path that needs to be read can have
sub paths, but the sub paths need to meet certain format requirements. Specific
requirements can be referred to "parse_partition_from_path" option
[...]
| file_format_type | string | yes | -
| File type, supported as the following file
types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
[...]
@@ -220,12 +220,66 @@ If you assign file type to `parquet` `orc`, schema option
not required, connecto
| compress_codec | string | no | none
|
[...]
| archive_compress_codec | string | no | none
|
[...]
| encoding | string | no | UTF-8
|
[...]
+| file_filter_pattern | string | no |
| Filter pattern, which used for filtering
files.
[...]
| common-options | | no | -
| Source plugin common parameters, please refer
to [Source Common Options](../source-common-options.md) for details.
[...]
### delimiter/field_delimiter [string]
**delimiter** parameter will deprecate after version 2.3.5, please use
**field_delimiter** instead.
+### file_filter_pattern [string]
+
+Filter pattern, which used for filtering files.
+
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### compress_codec [string]
The compress codec of files and the details that supported as the following
shown:
@@ -349,6 +403,33 @@ sink {
}
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ S3File {
+ path = "/seatunnel/json"
+ bucket = "s3a://seatunnel-test"
+ fs.s3a.endpoint="s3.cn-north-1.amazonaws.com.cn"
+
fs.s3a.aws.credentials.provider="com.amazonaws.auth.InstanceProfileCredentialsProvider"
+ file_format_type = "json"
+ read_columns = ["id", "name"]
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
+
## Changelog
### 2.3.0-beta 2022-10-20
diff --git a/docs/en/connector-v2/source/SftpFile.md
b/docs/en/connector-v2/source/SftpFile.md
index 3eadcd3a69..95c710110a 100644
--- a/docs/en/connector-v2/source/SftpFile.md
+++ b/docs/en/connector-v2/source/SftpFile.md
@@ -71,7 +71,7 @@ The File does not have a specific type list, and we can
indicate which SeaTunnel
## Source Options
-| Name | Type | Required | default value |
Description
|
+| Name | Type | Required | default value |
Description
|
|---------------------------|---------|----------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| host | String | Yes | - | The
target sftp host is required
|
| port | Int | Yes | - | The
target sftp port is required
|
@@ -96,6 +96,59 @@ The File does not have a specific type list, and we can
indicate which SeaTunnel
| encoding | string | no | UTF-8 |
| common-options | | No | - |
Source plugin common parameters, please refer to [Source Common
Options](../source-common-options.md) for details.
|
+### file_filter_pattern [string]
+
+Filter pattern, which used for filtering files.
+
+The pattern follows standard regular expressions. For details, please refer to
https://en.wikipedia.org/wiki/Regular_expression.
+There are some examples.
+
+File Structure Example:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+Matching Rules Example:
+
+**Example 1**: *Match all .txt files*,Regular Expression:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241001/report.txt
+```
+**Example 2**: *Match all file starting with abc*,Regular Expression:
+```
+/data/seatunnel/20241002/abc.*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**Example 3**: *Match all file starting with abc,And the fourth character is
either h or g*, the Regular Expression:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**Example 4**: *Match third level folders starting with 202410 and files
ending with .csv*, the Regular Expression:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+The result of this example matching is:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### file_format_type [string]
File type, supported as the following file types:
@@ -305,3 +358,30 @@ SftpFile {
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ SftpFile {
+ host = "sftp"
+ port = 22
+ user = seatunnel
+ password = pass
+ path = "tmp/seatunnel/read/json"
+ file_format_type = "json"
+ result_table_name = "sftp"
+ // file example abcD2024.csv
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
\ No newline at end of file
diff --git a/docs/zh/connector-v2/source/HdfsFile.md
b/docs/zh/connector-v2/source/HdfsFile.md
index 0f983a80bc..9cd254ef80 100644
--- a/docs/zh/connector-v2/source/HdfsFile.md
+++ b/docs/zh/connector-v2/source/HdfsFile.md
@@ -39,7 +39,7 @@
## 源选项
-| 名称 | 类型 | 是否必须 | 默认值 |
描述
|
+| 名称 | 类型 | 是否必须 | 默认值 | 描述
|
|---------------------------|---------|------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| path | string | 是 | - | 源文件路径。
|
| file_format_type | string | 是 | - |
我们支持以下文件类型:`text` `json` `csv` `orc` `parquet`
`excel`。请注意,最终文件名将以文件格式的后缀结束,文本文件的后缀是 `txt`。
|
@@ -55,6 +55,7 @@
| kerberos_principal | string | 否 | - | kerberos 的
principal。
|
| kerberos_keytab_path | string | 否 | - | kerberos 的
keytab 路径。
|
| skip_header_row_number | long | 否 | 0 | 跳过前几行,但仅适用于
txt 和 csv。例如,设置如下:`skip_header_row_number = 2`。然后 Seatunnel 将跳过源文件中的前两行。
|
+| file_filter_pattern | string | 否 | - | 过滤模式,用于过滤文件。
|
| schema | config | 否 | - | 上游数据的模式字段。
|
| sheet_name | string | 否 | - |
读取工作簿的表格,仅在文件格式为 excel 时使用。
|
| compress_codec | string | 否 | none | 文件的压缩编解码器。
|
@@ -64,6 +65,60 @@
**delimiter** 参数在版本 2.3.5 后将被弃用,请改用 **field_delimiter**。
+### file_filter_pattern [string]
+
+过滤模式,用于过滤文件。
+
+这个过滤规则遵循正则表达式. 关于详情,请参考 https://en.wikipedia.org/wiki/Regular_expression 学习
+
+这里是一些例子.
+
+文件清单:
+```
+/data/seatunnel/20241001/report.txt
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+/data/seatunnel/20241012/logo.png
+```
+匹配规则:
+
+**例子 1**: *匹配所有txt为后缀名的文件*,匹配正则为:
+```
+/data/seatunnel/20241001/.*\.txt
+```
+匹配的结果是:
+```
+/data/seatunnel/20241001/report.txt
+```
+**例子 2**: *匹配所有文件名以abc开头的文件*,匹配正则为:
+```
+/data/seatunnel/20241002/abc.*
+```
+匹配的结果是:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+```
+**例子 3**: *匹配所有文件名以abc开头,并且文件第四个字母是 h 或者 g 的文件*, 匹配正则为:
+```
+/data/seatunnel/20241007/abc[h,g].*
+```
+匹配的结果是:
+```
+/data/seatunnel/20241007/abch202410.csv
+```
+**例子 4**: *匹配所有文件夹第三级以 202410 开头并且文件后缀名是.csv的文件*, 匹配正则为:
+```
+/data/seatunnel/202410\d*/.*\.csv
+```
+匹配的结果是:
+```
+/data/seatunnel/20241007/abch202410.csv
+/data/seatunnel/20241002/abcg202410.csv
+/data/seatunnel/20241005/old_data.csv
+```
+
### compress_codec [string]
文件的压缩编解码器及支持的详细信息如下所示:
@@ -125,3 +180,25 @@ sink {
}
```
+### Filter File
+
+```hocon
+env {
+ parallelism = 1
+ job.mode = "BATCH"
+}
+
+source {
+ HdfsFile {
+ path = "/apps/hive/demo/student"
+ file_format_type = "json"
+ fs.defaultFS = "hdfs://namenode001"
+ file_filter_pattern = "abc[DX]*.*"
+ }
+}
+
+sink {
+ Console {
+ }
+}
+```
\ No newline at end of file