(seatunnel) branch dev updated: [Feature][File] Add markdown parser documentation (#9834)

fanjia Mon, 15 Sep 2025 22:30:13 -0700

This is an automated email from the ASF dual-hosted git repository.

fanjia pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git



The following commit(s) were added to refs/heads/dev by this push:
     new 36a1166afc [Feature][File] Add markdown parser documentation (#9834)
36a1166afc is described below

commit 36a1166afc39da7cf98bee3bd293a84806ee9520
Author: Joonseo Lee <[email protected]>
AuthorDate: Tue Sep 16 12:03:12 2025 +0900

    [Feature][File] Add markdown parser documentation (#9834)
---
 .../connector-v2/changelog/connector-file-cos.md   |  1 +
 .../connector-v2/changelog/connector-file-ftp.md   |  1 +
 .../changelog/connector-file-hadoop.md             |  1 +
 .../connector-v2/changelog/connector-file-local.md |  1 +
 .../connector-v2/changelog/connector-file-obs.md   |  1 +
 .../changelog/connector-file-oss-jindo.md          |  1 +
 .../connector-v2/changelog/connector-file-oss.md   |  1 +
 .../en/connector-v2/changelog/connector-file-s3.md |  1 +
 .../connector-v2/changelog/connector-file-sftp.md  |  1 +
 docs/en/connector-v2/changelog/connector-hive.md   |  1 +
 docs/en/connector-v2/source/CosFile.md             | 17 ++++++++++++++-
 docs/en/connector-v2/source/FtpFile.md             | 15 +++++++++++++
 docs/en/connector-v2/source/HdfsFile.md            | 23 +++++++++++++++++++-
 docs/en/connector-v2/source/Hive.md                | 13 +++++++++++
 docs/en/connector-v2/source/LocalFile.md           | 17 ++++++++++++++-
 docs/en/connector-v2/source/ObsFile.md             | 17 ++++++++++++++-
 docs/en/connector-v2/source/OssFile.md             | 25 ++++++++++++++++++++--
 docs/en/connector-v2/source/OssJindoFile.md        | 17 ++++++++++++++-
 docs/en/connector-v2/source/S3File.md              | 23 +++++++++++++++++++-
 docs/en/connector-v2/source/SftpFile.md            | 17 ++++++++++++++-
 .../connector-v2/changelog/connector-file-cos.md   |  1 +
 .../connector-v2/changelog/connector-file-ftp.md   |  1 +
 .../changelog/connector-file-hadoop.md             |  1 +
 .../connector-v2/changelog/connector-file-local.md |  1 +
 .../connector-v2/changelog/connector-file-obs.md   |  1 +
 .../changelog/connector-file-oss-jindo.md          |  1 +
 .../connector-v2/changelog/connector-file-oss.md   |  1 +
 .../zh/connector-v2/changelog/connector-file-s3.md |  1 +
 .../connector-v2/changelog/connector-file-sftp.md  |  1 +
 docs/zh/connector-v2/changelog/connector-hive.md   |  1 +
 docs/zh/connector-v2/source/CosFile.md             | 17 ++++++++++++++-
 docs/zh/connector-v2/source/FtpFile.md             | 16 +++++++++++++-
 docs/zh/connector-v2/source/HdfsFile.md            | 23 +++++++++++++++++++-
 docs/zh/connector-v2/source/Hive.md                | 13 +++++++++++
 docs/zh/connector-v2/source/LocalFile.md           | 17 ++++++++++++++-
 docs/zh/connector-v2/source/OssFile.md             | 25 ++++++++++++++++++++--
 docs/zh/connector-v2/source/S3File.md              | 23 +++++++++++++++++++-
 docs/zh/connector-v2/source/SftpFile.md            | 17 ++++++++++++++-
 38 files changed, 338 insertions(+), 17 deletions(-)

diff --git a/docs/en/connector-v2/changelog/connector-file-cos.md 
b/docs/en/connector-v2/changelog/connector-file-cos.md
index 748e1c5e76..95d8749d90 100644
--- a/docs/en/connector-v2/changelog/connector-file-cos.md
+++ b/docs/en/connector-v2/changelog/connector-file-cos.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/en/connector-v2/changelog/connector-file-ftp.md 
b/docs/en/connector-v2/changelog/connector-file-ftp.md
index edb03e33b1..bc78c789c6 100644
--- a/docs/en/connector-v2/changelog/connector-file-ftp.md
+++ b/docs/en/connector-v2/changelog/connector-file-ftp.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[Improve][Connector-V2] Add remote host verification option for FTP data 
channels (#9324)|https://github.com/apache/seatunnel/commit/019d69d10a|2.3.11|
diff --git a/docs/en/connector-v2/changelog/connector-file-hadoop.md 
b/docs/en/connector-v2/changelog/connector-file-hadoop.md
index 78c05b0633..218523ea7b 100644
--- a/docs/en/connector-v2/changelog/connector-file-hadoop.md
+++ b/docs/en/connector-v2/changelog/connector-file-hadoop.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Feature][Connector-V2] Support hdfs file multi table source read 
(#9816)|https://github.com/apache/seatunnel/commit/672af255ef| dev |
 |[Feature][Connector-File-Hadoop]Support multi table sink feature for HdfsFile 
(#9651)|https://github.com/apache/seatunnel/commit/bb4f743c05|2.3.12|
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
diff --git a/docs/en/connector-v2/changelog/connector-file-local.md 
b/docs/en/connector-v2/changelog/connector-file-local.md
index 6c09c07c4b..9453f02b9f 100644
--- a/docs/en/connector-v2/changelog/connector-file-local.md
+++ b/docs/en/connector-v2/changelog/connector-file-local.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] File Source Support filtering files by last modified 
time.  (#9526)|https://github.com/apache/seatunnel/commit/cde4c3d410|2.3.12|
 |[Feature][Format] Improve maxwell_json,canal_json,debezium_json format add 
ts_ms and table 
(#9701)|https://github.com/apache/seatunnel/commit/fb8444b946|2.3.12|
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
diff --git a/docs/en/connector-v2/changelog/connector-file-obs.md 
b/docs/en/connector-v2/changelog/connector-file-obs.md
index 64871f6711..6af012bd6e 100644
--- a/docs/en/connector-v2/changelog/connector-file-obs.md
+++ b/docs/en/connector-v2/changelog/connector-file-obs.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/en/connector-v2/changelog/connector-file-oss-jindo.md 
b/docs/en/connector-v2/changelog/connector-file-oss-jindo.md
index a7d6fe69d2..6da2dabb5f 100644
--- a/docs/en/connector-v2/changelog/connector-file-oss-jindo.md
+++ b/docs/en/connector-v2/changelog/connector-file-oss-jindo.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev || --- | --- | 
--- |
 |[Improve][Connector-V2][OSS-Jindo] Optimize jindo oss connector 
(#4964)|https://github.com/apache/seatunnel/commit/5fbfd05061|2.3.3|
 |[Fix][Connector-V2] Fix file-oss config check bug and amend file-oss-jindo 
factoryIdentifier 
(#4581)|https://github.com/apache/seatunnel/commit/5c4f17df20|2.3.2|
 | [Feature][ConnectorV2]add file excel sink and source 
(#4164)|https://github.com/apache/seatunnel/commit/e3b97ae5d2|2.3.2|
diff --git a/docs/en/connector-v2/changelog/connector-file-oss.md 
b/docs/en/connector-v2/changelog/connector-file-oss.md
index 80f32c03fa..cf04ef82b9 100644
--- a/docs/en/connector-v2/changelog/connector-file-oss.md
+++ b/docs/en/connector-v2/changelog/connector-file-oss.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[Doc][Connector-V2] Update save mode config for OssFileSink 
(#9303)|https://github.com/apache/seatunnel/commit/40097d7f3e|2.3.11|
diff --git a/docs/en/connector-v2/changelog/connector-file-s3.md 
b/docs/en/connector-v2/changelog/connector-file-s3.md
index 42fcc5befd..8e557f689b 100644
--- a/docs/en/connector-v2/changelog/connector-file-s3.md
+++ b/docs/en/connector-v2/changelog/connector-file-s3.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/en/connector-v2/changelog/connector-file-sftp.md 
b/docs/en/connector-v2/changelog/connector-file-sftp.md
index 022a7d4934..84fa2facfa 100644
--- a/docs/en/connector-v2/changelog/connector-file-sftp.md
+++ b/docs/en/connector-v2/changelog/connector-file-sftp.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/en/connector-v2/changelog/connector-hive.md 
b/docs/en/connector-v2/changelog/connector-hive.md
index 433aa3a2bc..91c969fb99 100644
--- a/docs/en/connector-v2/changelog/connector-hive.md
+++ b/docs/en/connector-v2/changelog/connector-hive.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][API] Optimize the enumerator API semantics and reduce lock calls at 
the connector level 
(#9671)|https://github.com/apache/seatunnel/commit/9212a77140|2.3.12|
 |[Feature][connector-hive] hive sink connector support overwrite mode #7843 
(#7891)|https://github.com/apache/seatunnel/commit/6fafe6f4d3|2.3.12|
 |[Fix][Connector-V2] Fix hive client thread unsafe 
(#9282)|https://github.com/apache/seatunnel/commit/5dc25897a9|2.3.11|
diff --git a/docs/en/connector-v2/source/CosFile.md 
b/docs/en/connector-v2/source/CosFile.md
index 2a0c5867fe..9818b641a5 100644
--- a/docs/en/connector-v2/source/CosFile.md
+++ b/docs/en/connector-v2/source/CosFile.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-cos.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Description
 
@@ -91,7 +92,7 @@ The source file path.
 
 File type, supported as the following file types:
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 
 If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
 
@@ -180,6 +181,20 @@ such as compressed packages, pictures, etc. In short, any 
files can be synchroni
 Under this requirement, you need to ensure that the source and sink use 
`binary` format for file synchronization
 at the same time. You can find the specific usage in the example below.
 
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### bucket [string]
 
 The bucket address of Cos file system, for example: 
`Cos://tyrantlucifer-image-bed`
diff --git a/docs/en/connector-v2/source/FtpFile.md 
b/docs/en/connector-v2/source/FtpFile.md
index 29afff71cb..9705e227a5 100644
--- a/docs/en/connector-v2/source/FtpFile.md
+++ b/docs/en/connector-v2/source/FtpFile.md
@@ -29,6 +29,7 @@ import ChangeLog from '../changelog/connector-file-ftp.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Description
 
@@ -259,6 +260,20 @@ such as compressed packages, pictures, etc. In short, any 
files can be synchroni
 Under this requirement, you need to ensure that the source and sink use 
`binary` format for file synchronization
 at the same time. You can find the specific usage in the example below.
 
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### connection_mode [string]
 
 The target ftp connection mode , default is active mode, supported as the 
following modes:
diff --git a/docs/en/connector-v2/source/HdfsFile.md 
b/docs/en/connector-v2/source/HdfsFile.md
index ca3dfb65b6..b208497df1 100644
--- a/docs/en/connector-v2/source/HdfsFile.md
+++ b/docs/en/connector-v2/source/HdfsFile.md
@@ -35,6 +35,7 @@ import ChangeLog from '../changelog/connector-file-hadoop.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Description
 
@@ -51,7 +52,7 @@ Read data from hdfs file system.
 | Name                      | Type    | Required | Default                     
| Description                                                                   
                                                                                
                                                                                
                                                                                
                |
 
|---------------------------|---------|----------|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | path                      | string  | yes      | -                           
| The source file path.                                                         
                                                                                
                                                                                
                                                                                
                |
-| file_format_type          | string  | yes      | -                           
| We supported as the following file types:`text` `csv` `parquet` `orc` `json` 
`excel` `xml` `binary`.Please note that, The final file name will end with the 
file_format's suffix, the suffix of the text file is `txt`.                     
                                                                                
                  |
+| file_format_type          | string  | yes      | -                           
| We supported as the following file types:`text` `csv` `parquet` `orc` `json` 
`excel` `xml` `binary` `markdown`.Please note that, The final file name will 
end with the file_format's suffix, the suffix of the text file is `txt`.        
                                                                                
                    |
 | fs.defaultFS              | string  | yes      | -                           
| The hadoop cluster address that start with `hdfs://`, for example: 
`hdfs://hadoopcluster`                                                          
                                                                                
                                                                                
                           |
 | read_columns              | list    | no       | -                           
| The read column list of the data source, user can use it to implement field 
projection.The file type supported column projection as the following 
shown:[text,json,csv,orc,parquet,excel,xml].Tips: If the user wants to use this 
feature when reading `text` `json` `csv` files, the schema option must be 
configured.                       |
 | hdfs_site_path            | string  | no       | -                           
| The path of `hdfs-site.xml`, used to load ha configuration of namenodes       
                                                                                
                                                                                
                                                                                
                |
@@ -83,6 +84,26 @@ Read data from hdfs file system.
 | file_filter_modified_start  | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification start time (include start time). The default data format is 
`yyyy-MM-dd HH:mm:ss`.                                                          
                                                                                
             |
 | file_filter_modified_end    | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification end time (not include end time). The default data format is 
`yyyy-MM-dd HH:mm:ss`.                                                          
                                                                                
      |
 
+### file_format_type [string]
+
+File type, supported as the following file types:
+
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
+
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### delimiter/field_delimiter [string]
 
 **delimiter** parameter will deprecate after version 2.3.5, please use 
**field_delimiter** instead.
diff --git a/docs/en/connector-v2/source/Hive.md 
b/docs/en/connector-v2/source/Hive.md
index 4461cb5348..7d99051e73 100644
--- a/docs/en/connector-v2/source/Hive.md
+++ b/docs/en/connector-v2/source/Hive.md
@@ -8,6 +8,18 @@ import ChangeLog from '../changelog/connector-hive.md';
 
 Read data from Hive.
 
+When using markdown format, SeaTunnel can parse markdown files stored in Hive 
tables and extract structured data with elements like headings, paragraphs, 
lists, code blocks, and tables. Each element is converted to a row with the 
following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 :::tip
 
 In order to use this connector, You must ensure your spark/flink cluster 
already integrated hive. The tested hive version is 2.3.9 and 3.1.3 .
@@ -32,6 +44,7 @@ Read all the data in a split in a pollNext call. What splits 
are read will be sa
   - [x] parquet
   - [x] orc
   - [x] json
+  - [x] markdown
 
 ## Options
 
diff --git a/docs/en/connector-v2/source/LocalFile.md 
b/docs/en/connector-v2/source/LocalFile.md
index b460101418..6b41e5f420 100644
--- a/docs/en/connector-v2/source/LocalFile.md
+++ b/docs/en/connector-v2/source/LocalFile.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-local.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Description
 
@@ -87,7 +88,7 @@ The source file path.
 
 File type, supported as the following file types:
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 
 If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
 
@@ -176,6 +177,20 @@ such as compressed packages, pictures, etc. In short, any 
files can be synchroni
 Under this requirement, you need to ensure that the source and sink use 
`binary` format for file synchronization
 at the same time. You can find the specific usage in the example below.
 
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### read_columns [list]
 
 The read column list of the data source, user can use it to implement field 
projection.
diff --git a/docs/en/connector-v2/source/ObsFile.md 
b/docs/en/connector-v2/source/ObsFile.md
index c1eb57f8ea..f9004ee59e 100644
--- a/docs/en/connector-v2/source/ObsFile.md
+++ b/docs/en/connector-v2/source/ObsFile.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-obs.md';
   - [x] orc
   - [x] json
   - [x] excel
+  - [x] markdown
 
 ## Description
 
@@ -138,7 +139,7 @@ It only supports hadoop version **2.9.X+**.
 
 > File type, supported as the following file types:
 >
-> `text` `csv` `parquet` `orc` `json` `excel`
+> `text` `csv` `parquet` `orc` `json` `excel` `markdown`
 >
 > If you assign file type to `json`, you should also assign schema option to 
 > tell the connector how to parse data to the row you want.
 >
@@ -222,6 +223,20 @@ schema {
 |---------------|-----|--------|
 | tyrantlucifer | 26  | male   |
 
+> If you assign file type to `markdown`, SeaTunnel can parse markdown files 
and extract structured data.
+> The markdown parser extracts various elements including headings, 
paragraphs, lists, code blocks, tables, and more.
+> Each element is converted to a row with the following schema:
+> - `element_id`: Unique identifier for the element
+> - `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+> - `heading_level`: Level of heading (1-6, null for non-heading elements)
+> - `text`: Text content of the element
+> - `page_number`: Page number (default: 1)
+> - `position_index`: Position index within the document
+> - `parent_id`: ID of the parent element
+> - `child_ids`: Comma-separated list of child element IDs
+>
+> Note: Markdown format only supports reading, not writing.
+
 #### <span id="schema"> schema  </span>
 
 ##### fields
diff --git a/docs/en/connector-v2/source/OssFile.md 
b/docs/en/connector-v2/source/OssFile.md
index ac76faf0ab..6f664fe6c4 100644
--- a/docs/en/connector-v2/source/OssFile.md
+++ b/docs/en/connector-v2/source/OssFile.md
@@ -45,12 +45,13 @@ import ChangeLog from '../changelog/connector-file-oss.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Data Type Mapping
 
 Data type mapping is related to the type of file being read, We supported as 
the following file types:
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `markdown`
 
 ### JSON File Type
 
@@ -185,7 +186,7 @@ If you assign file type to `parquet` `orc`, schema option 
not required, connecto
 | name                      | type    | required | default value       | 
Description                                                                     
                                                                                
                                                                                
                                                                                
    |
 
|---------------------------|---------|----------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | path                      | string  | yes      | -                   | The 
Oss path that needs to be read can have sub paths, but the sub paths need to 
meet certain format requirements. Specific requirements can be referred to 
"parse_partition_from_path" option                                              
                                                                                
        |
-| file_format_type          | string  | yes      | -                   | File 
type, supported as the following file types: `text` `csv` `parquet` `orc` 
`json` `excel` `xml` `binary`                                                   
                                                                                
                                                                                
     |
+| file_format_type          | string  | yes      | -                   | File 
type, supported as the following file types: `text` `csv` `parquet` `orc` 
`json` `excel` `xml` `binary` `markdown`                                        
                                                                                
                                                                                
                |
 | bucket                    | string  | yes      | -                   | The 
bucket address of oss file system, for example: `oss://seatunnel-test`.         
                                                                                
                                                                                
                                                                                
|
 | endpoint                  | string  | yes      | -                   | fs 
oss endpoint                                                                    
                                                                                
                                                                                
                                                                                
 |
 | read_columns              | list    | no       | -                   | The 
read column list of the data source, user can use it to implement field 
projection. The file type supported column projection as the following shown: 
`text` `csv` `parquet` `orc` `json` `excel` `xml` . If the user wants to use 
this feature when reading `text` `json` `csv` files, the "schema" option must 
be configured. |
@@ -215,6 +216,26 @@ If you assign file type to `parquet` `orc`, schema option 
not required, connecto
 | file_filter_modified_start  | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification start time (include start time). The default data format is 
`yyyy-MM-dd HH:mm:ss`.                                                          
                                                                                
             |
 | file_filter_modified_end    | string  | no       | -                   | 
File modification time filter. The connector will filter some files base on the 
last modification end time (not include end time). The default data format is 
`yyyy-MM-dd HH:mm:ss`.                                                          
                                                                                
      |
 
+### file_format_type [string]
+
+File type, supported as the following file types:
+
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
+
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### compress_codec [string]
 
 The compress codec of files and the details that supported as the following 
shown:
diff --git a/docs/en/connector-v2/source/OssJindoFile.md 
b/docs/en/connector-v2/source/OssJindoFile.md
index 7e6df5c7c4..59183b555e 100644
--- a/docs/en/connector-v2/source/OssJindoFile.md
+++ b/docs/en/connector-v2/source/OssJindoFile.md
@@ -34,6 +34,7 @@ import ChangeLog from 
'../changelog/connector-file-oss-jindo.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Description
 
@@ -93,7 +94,7 @@ The source file path.
 
 File type, supported as the following file types:
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 
 If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
 
@@ -182,6 +183,20 @@ such as compressed packages, pictures, etc. In short, any 
files can be synchroni
 Under this requirement, you need to ensure that the source and sink use 
`binary` format for file synchronization
 at the same time. You can find the specific usage in the example below.
 
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### bucket [string]
 
 The bucket address of oss file system, for example: 
`oss://tyrantlucifer-image-bed`
diff --git a/docs/en/connector-v2/source/S3File.md 
b/docs/en/connector-v2/source/S3File.md
index fa9831f0b6..7f61c416a2 100644
--- a/docs/en/connector-v2/source/S3File.md
+++ b/docs/en/connector-v2/source/S3File.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-s3.md';
     - [x] excel
     - [x] xml
     - [x] binary
+    - [x] markdown
 
 ## Description
 
@@ -191,7 +192,7 @@ If you assign file type to `parquet` `orc`, schema option 
not required, connecto
 | name                            | type    | required | default value         
                                | Description                                   
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
 
|---------------------------------|---------|----------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
 | path                            | string  | yes      | -                     
                                | The s3 path that needs to be read can have 
sub paths, but the sub paths need to meet certain format requirements. Specific 
requirements can be referred to "parse_partition_from_path" option              
                                                                                
                                                                                
                 [...]
-| file_format_type                | string  | yes      | -                     
                                | File type, supported as the following file 
types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`               
                                                                                
                                                                                
                                                                                
                 [...]
+| file_format_type                | string  | yes      | -                     
                                | File type, supported as the following file 
types: `text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`    
                                                                                
                                                                                
                                                                                
                 [...]
 | bucket                          | string  | yes      | -                     
                                | The bucket address of s3 file system, for 
example: `s3n://seatunnel-test`, if you use `s3a` protocol, this parameter 
should be `s3a://seatunnel-test`.                                               
                                                                                
                                                                                
                       [...]
 | fs.s3a.endpoint                 | string  | yes      | -                     
                                | fs s3a endpoint                               
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
 | fs.s3a.aws.credentials.provider | string  | yes      | 
com.amazonaws.auth.InstanceProfileCredentialsProvider | The way to authenticate 
s3a. We only support `org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider` 
and `com.amazonaws.auth.InstanceProfileCredentialsProvider` now. More 
information about the credential provider you can see [Hadoop AWS 
Document](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Simple_name.2Fsecret_credentials_with_SimpleAWSCredenti
 [...]
@@ -222,6 +223,26 @@ If you assign file type to `parquet` `orc`, schema option 
not required, connecto
 | filename_extension              | string  | no       | -                     
                                | Filter filename extension, which used for 
filtering files with specific extension. Example: `csv` `.txt` `json` `.xml`.   
                                                                                
                                                                                
                                                                                
                  [...]
 | common-options                  |         | no       | -                     
                                | Source plugin common parameters, please refer 
to [Source Common Options](../source-common-options.md) for details.            
                                                                                
                                                                                
                                                                                
              [...]
 
+### file_format_type [string]
+
+File type, supported as the following file types:
+
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
+
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### delimiter/field_delimiter [string]
 
 **delimiter** parameter will deprecate after version 2.3.5, please use 
**field_delimiter** instead.
diff --git a/docs/en/connector-v2/source/SftpFile.md 
b/docs/en/connector-v2/source/SftpFile.md
index 1b2ab5df96..40333c96de 100644
--- a/docs/en/connector-v2/source/SftpFile.md
+++ b/docs/en/connector-v2/source/SftpFile.md
@@ -29,6 +29,7 @@ import ChangeLog from '../changelog/connector-file-sftp.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## Description
 
@@ -166,7 +167,7 @@ The result of this example matching is:
 ### file_format_type [string]
 
 File type, supported as the following file types:
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 If you assign file type to `json`, you should also assign schema option to 
tell connector how to parse data to the row you want.
 For example:
 upstream data is the following:
@@ -234,6 +235,20 @@ such as compressed packages, pictures, etc. In short, any 
files can be synchroni
 Under this requirement, you need to ensure that the source and sink use 
`binary` format for file synchronization
 at the same time.
 
+If you assign file type to `markdown`, SeaTunnel can parse markdown files and 
extract structured data.
+The markdown parser extracts various elements including headings, paragraphs, 
lists, code blocks, tables, and more.
+Each element is converted to a row with the following schema:
+- `element_id`: Unique identifier for the element
+- `element_type`: Type of the element (Heading, Paragraph, ListItem, etc.)
+- `heading_level`: Level of heading (1-6, null for non-heading elements)
+- `text`: Text content of the element
+- `page_number`: Page number (default: 1)
+- `position_index`: Position index within the document
+- `parent_id`: ID of the parent element
+- `child_ids`: Comma-separated list of child element IDs
+
+Note: Markdown format only supports reading, not writing.
+
 ### compress_codec [string]
 
 The compress codec of files and the details that supported as the following 
shown:
diff --git a/docs/zh/connector-v2/changelog/connector-file-cos.md 
b/docs/zh/connector-v2/changelog/connector-file-cos.md
index 748e1c5e76..95d8749d90 100644
--- a/docs/zh/connector-v2/changelog/connector-file-cos.md
+++ b/docs/zh/connector-v2/changelog/connector-file-cos.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/zh/connector-v2/changelog/connector-file-ftp.md 
b/docs/zh/connector-v2/changelog/connector-file-ftp.md
index edb03e33b1..bc78c789c6 100644
--- a/docs/zh/connector-v2/changelog/connector-file-ftp.md
+++ b/docs/zh/connector-v2/changelog/connector-file-ftp.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[Improve][Connector-V2] Add remote host verification option for FTP data 
channels (#9324)|https://github.com/apache/seatunnel/commit/019d69d10a|2.3.11|
diff --git a/docs/zh/connector-v2/changelog/connector-file-hadoop.md 
b/docs/zh/connector-v2/changelog/connector-file-hadoop.md
index 78c05b0633..218523ea7b 100644
--- a/docs/zh/connector-v2/changelog/connector-file-hadoop.md
+++ b/docs/zh/connector-v2/changelog/connector-file-hadoop.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Feature][Connector-V2] Support hdfs file multi table source read 
(#9816)|https://github.com/apache/seatunnel/commit/672af255ef| dev |
 |[Feature][Connector-File-Hadoop]Support multi table sink feature for HdfsFile 
(#9651)|https://github.com/apache/seatunnel/commit/bb4f743c05|2.3.12|
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
diff --git a/docs/zh/connector-v2/changelog/connector-file-local.md 
b/docs/zh/connector-v2/changelog/connector-file-local.md
index 6c09c07c4b..9453f02b9f 100644
--- a/docs/zh/connector-v2/changelog/connector-file-local.md
+++ b/docs/zh/connector-v2/changelog/connector-file-local.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] File Source Support filtering files by last modified 
time.  (#9526)|https://github.com/apache/seatunnel/commit/cde4c3d410|2.3.12|
 |[Feature][Format] Improve maxwell_json,canal_json,debezium_json format add 
ts_ms and table 
(#9701)|https://github.com/apache/seatunnel/commit/fb8444b946|2.3.12|
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
diff --git a/docs/zh/connector-v2/changelog/connector-file-obs.md 
b/docs/zh/connector-v2/changelog/connector-file-obs.md
index 64871f6711..6af012bd6e 100644
--- a/docs/zh/connector-v2/changelog/connector-file-obs.md
+++ b/docs/zh/connector-v2/changelog/connector-file-obs.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/zh/connector-v2/changelog/connector-file-oss-jindo.md 
b/docs/zh/connector-v2/changelog/connector-file-oss-jindo.md
index a7d6fe69d2..6da2dabb5f 100644
--- a/docs/zh/connector-v2/changelog/connector-file-oss-jindo.md
+++ b/docs/zh/connector-v2/changelog/connector-file-oss-jindo.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev || --- | --- | 
--- |
 |[Improve][Connector-V2][OSS-Jindo] Optimize jindo oss connector 
(#4964)|https://github.com/apache/seatunnel/commit/5fbfd05061|2.3.3|
 |[Fix][Connector-V2] Fix file-oss config check bug and amend file-oss-jindo 
factoryIdentifier 
(#4581)|https://github.com/apache/seatunnel/commit/5c4f17df20|2.3.2|
 | [Feature][ConnectorV2]add file excel sink and source 
(#4164)|https://github.com/apache/seatunnel/commit/e3b97ae5d2|2.3.2|
diff --git a/docs/zh/connector-v2/changelog/connector-file-oss.md 
b/docs/zh/connector-v2/changelog/connector-file-oss.md
index 80f32c03fa..cf04ef82b9 100644
--- a/docs/zh/connector-v2/changelog/connector-file-oss.md
+++ b/docs/zh/connector-v2/changelog/connector-file-oss.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[Doc][Connector-V2] Update save mode config for OssFileSink 
(#9303)|https://github.com/apache/seatunnel/commit/40097d7f3e|2.3.11|
diff --git a/docs/zh/connector-v2/changelog/connector-file-s3.md 
b/docs/zh/connector-v2/changelog/connector-file-s3.md
index 42fcc5befd..8e557f689b 100644
--- a/docs/zh/connector-v2/changelog/connector-file-s3.md
+++ b/docs/zh/connector-v2/changelog/connector-file-s3.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/zh/connector-v2/changelog/connector-file-sftp.md 
b/docs/zh/connector-v2/changelog/connector-file-sftp.md
index 022a7d4934..84fa2facfa 100644
--- a/docs/zh/connector-v2/changelog/connector-file-sftp.md
+++ b/docs/zh/connector-v2/changelog/connector-file-sftp.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][Connector-V2] Add customizable row delimiter support for text file 
processing (#9608)|https://github.com/apache/seatunnel/commit/7898e62e01|2.3.12|
 |[Improve][Connector-V2] Support maxcompute sink writer with timestamp field 
type (#9234)|https://github.com/apache/seatunnel/commit/a513c495e3|2.3.12|
 |[improve] update file connectors config 
(#9034)|https://github.com/apache/seatunnel/commit/8041d59dc2|2.3.11|
diff --git a/docs/zh/connector-v2/changelog/connector-hive.md 
b/docs/zh/connector-v2/changelog/connector-hive.md
index 433aa3a2bc..91c969fb99 100644
--- a/docs/zh/connector-v2/changelog/connector-hive.md
+++ b/docs/zh/connector-v2/changelog/connector-hive.md
@@ -2,6 +2,7 @@
 
 | Change | Commit | Version |
 | --- | --- | --- |
+|[Feature][File] Add markdown parser 
#9714|https://github.com/apache/seatunnel/commit/8b3c07844| dev |
 |[Improve][API] Optimize the enumerator API semantics and reduce lock calls at 
the connector level 
(#9671)|https://github.com/apache/seatunnel/commit/9212a77140|2.3.12|
 |[Feature][connector-hive] hive sink connector support overwrite mode #7843 
(#7891)|https://github.com/apache/seatunnel/commit/6fafe6f4d3|2.3.12|
 |[Fix][Connector-V2] Fix hive client thread unsafe 
(#9282)|https://github.com/apache/seatunnel/commit/5dc25897a9|2.3.11|
diff --git a/docs/zh/connector-v2/source/CosFile.md 
b/docs/zh/connector-v2/source/CosFile.md
index 33861b30d8..dfa3f30e0f 100644
--- a/docs/zh/connector-v2/source/CosFile.md
+++ b/docs/zh/connector-v2/source/CosFile.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-cos.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## 描述
 
@@ -90,7 +91,7 @@ import ChangeLog from '../changelog/connector-file-cos.md';
 
 文件类型，支持以下文件类型：
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 
 如果您将文件类型设置为“json”，您还应该分配模式选项，告诉连接器如何将数据解析到所需的行。
 
@@ -176,6 +177,20 @@ schema {
 
 如果将文件类型指定为“二进制”，SeaTunnel可以同步任何格式的文件，
 例如压缩包、图片等。简而言之，任何文件都可以同步到目标位置。
+
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
 根据此要求，您需要确保源端和目标端使用“二进制”格式进行文件同步同时。您可以在下面的示例中找到具体用法。
 
 ### bucket [string]
diff --git a/docs/zh/connector-v2/source/FtpFile.md 
b/docs/zh/connector-v2/source/FtpFile.md
index 61a349f2ee..c19e693390 100644
--- a/docs/zh/connector-v2/source/FtpFile.md
+++ b/docs/zh/connector-v2/source/FtpFile.md
@@ -159,7 +159,7 @@ import ChangeLog from '../changelog/connector-file-ftp.md';
 
 文件类型，支持以下文件类型：
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 
 如果您将文件类型指定为 `json`，您还需要指定 schema 选项以告诉连接器如何将数据解析为您所需的行。
 
@@ -237,6 +237,20 @@ schema {
 在这种情况下，您需要确保源和接收端同时使用 `binary` 格式进行文件同步。
 您可以在下面的示例中找到具体用法。
 
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
+
 ### connection_mode [string]
 
 目标 FTP 连接模式，默认为主动模式，支持以下模式：
diff --git a/docs/zh/connector-v2/source/HdfsFile.md 
b/docs/zh/connector-v2/source/HdfsFile.md
index 6b1ce1b73e..4dd2e457c3 100644
--- a/docs/zh/connector-v2/source/HdfsFile.md
+++ b/docs/zh/connector-v2/source/HdfsFile.md
@@ -35,6 +35,7 @@ import ChangeLog from '../changelog/connector-file-hadoop.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## 描述
 
@@ -51,7 +52,7 @@ import ChangeLog from '../changelog/connector-file-hadoop.md';
 | 名称                      | 类型    | 是否必须 | 默认值             | 描述                
                                                                                
                                                                                
                                                                                
                                                                   |
 
|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | path                      | string  | 是      | -                   | 源文件路径。  
                                                                                
                                                                                
                                                                                
                                                                       |
-| file_format_type          | string  | 是      | -                   | 
我们支持以下文件类型：`text` `csv` `parquet` `orc` `json` `excel` `xml` 
`binary`。请注意，最终文件名将以文件格式的后缀结束，文本文件的后缀是 `txt`。                                   
                                                                                
    |
+| file_format_type          | string  | 是      | -                   | 
我们支持以下文件类型：`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` 
`markdown`。请注意，最终文件名将以文件格式的后缀结束，文本文件的后缀是 `txt`。                                 
                                                                                
      |
 | fs.defaultFS              | string  | 是      | -                   | 以 
`hdfs://` 开头的 hadoop 集群地址，例如：`hdfs://hadoopcluster`                             
                                                                                
                                                                                
                                                                        |
 | read_columns              | list    | 否       | -                   | 
数据源的读取列列表，用户可以使用它来实现字段投影。支持列投影的文件类型如下所示：[text,json,csv,orc,parquet,excel,xml]。提示：如果用户想在读取
 `text` `json` `csv` 文件时使用此功能，必须配置 schema 选项。                       |
 | hdfs_site_path            | string  | 否       | -                   | 
`hdfs-site.xml` 的路径，用于加载 namenodes 的 ha 配置                                      
                                                                                
                                                                                
                                                                                
 |
@@ -83,6 +84,26 @@ import ChangeLog from 
'../changelog/connector-file-hadoop.md';
 | file_filter_modified_start  | string  | 否    | -                   | 
按照最后修改时间过滤文件。 要过滤的开始时间(包括改时间),时间格式是：`yyyy-MM-dd HH:mm:ss`                       
                                                          |
 | file_filter_modified_end    | string  | 否    | -                   | 
按照最后修改时间过滤文件。 要过滤的结束时间(不包括改时间),时间格式是：`yyyy-MM-dd HH:mm:ss`                      
                                                                                
             |
 
+### file_format_type [string]
+
+文件类型，支持以下文件类型：
+
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
+
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
+
 ### delimiter/field_delimiter [string]
 
 **delimiter** 参数将在 2.3.5 版本后弃用，请使用 **field_delimiter** 代替。
diff --git a/docs/zh/connector-v2/source/Hive.md 
b/docs/zh/connector-v2/source/Hive.md
index a17d552946..396585b545 100644
--- a/docs/zh/connector-v2/source/Hive.md
+++ b/docs/zh/connector-v2/source/Hive.md
@@ -8,6 +8,18 @@ import ChangeLog from '../changelog/connector-hive.md';
 
 从 Hive 读取数据。
 
+使用 markdown 格式时，SeaTunnel 可以解析存储在 Hive 表中的 markdown 
文件并提取结构化数据，包括标题、段落、列表、代码块和表格等元素。每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
+
 :::tip 提示
 
 为了使用此连接器，您必须确保您的 Spark/Flink 集群已经集成了 Hive。测试过的 Hive 版本是 2.3.9 和 3.1.3。
@@ -32,6 +44,7 @@ import ChangeLog from '../changelog/connector-hive.md';
     - [x] Parquet
     - [x] ORC
     - [x] JSON
+    - [x] markdown
 
 ## 选项
 
diff --git a/docs/zh/connector-v2/source/LocalFile.md 
b/docs/zh/connector-v2/source/LocalFile.md
index 4bb6f45b08..cbb14fcb23 100644
--- a/docs/zh/connector-v2/source/LocalFile.md
+++ b/docs/zh/connector-v2/source/LocalFile.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-local.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## 描述
 
@@ -88,7 +89,7 @@ import ChangeLog from '../changelog/connector-file-local.md';
 
 文件类型，支持以下文件类型：
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 
 如果您将文件类型指定为 `json`，您还应该指定 schema 选项来告诉连接器如何将数据解析为您想要的行。
 
@@ -177,6 +178,20 @@ schema {
 在此要求下，您需要确保源和接收器同时使用 `binary` 格式进行文件同步。
 您可以在下面的示例中找到具体用法。
 
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
+
 ### read_columns [list]
 
 数据源的读取列列表，用户可以使用它来实现字段投影。
diff --git a/docs/zh/connector-v2/source/OssFile.md 
b/docs/zh/connector-v2/source/OssFile.md
index 87ae7c34bb..91657ae1bd 100644
--- a/docs/zh/connector-v2/source/OssFile.md
+++ b/docs/zh/connector-v2/source/OssFile.md
@@ -45,12 +45,13 @@ import ChangeLog from '../changelog/connector-file-oss.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## 数据类型映射
 
 数据类型映射与正在读取的文件类型相关，我们支持以下文件类型：
 
-`text` `csv` `parquet` `orc` `json` `excel` `xml`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `markdown`
 
 ### JSON文件类型
 
@@ -185,7 +186,7 @@ schema {
 | 名称                      | 类型    | 是否必需 | 默认值       | 描述                      
                                                                                
                                                                                
                                                                                
                                                   |
 
|---------------------------|---------|----------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | path                      | string  | 是      | -                   | 
需要读取的Oss路径，可以有子路径，但子路径需要满足一定的格式要求。具体要求可以参考"parse_partition_from_path"选项         
                                                                                
                                             |
-| file_format_type          | string  | 是      | -                   | 
文件类型，支持以下文件类型：`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`        
                                                                                
                                                                                
                                                                |
+| file_format_type          | string  | 是      | -                   | 
文件类型，支持以下文件类型：`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` 
`markdown`                                                                      
                                                                                
                                                                                
  |
 | bucket                    | string  | 是      | -                   | 
oss文件系统的bucket地址，例如：`oss://seatunnel-test`。                                     
                                                                                
                                                                                
                                                                    |
 | endpoint                  | string  | 是      | -                   | fs 
oss端点                                                                           
                                                                                
                                                                                
                                                                          |
 | read_columns              | list    | 否       | -                   | 
数据源的读取列列表，用户可以使用它来实现字段投影。支持列投影的文件类型如下所示：`text` `csv` `parquet` `orc` `json` 
`excel` `xml`。如果用户想在读取`text` `json` `csv`文件时使用此功能，必须配置"schema"选项。 |
@@ -241,6 +242,26 @@ schema {
 
 是否将完整文件作为单个块读取，而不是分割成块。启用时，整个文件内容将一次性读入内存。默认为false。
 
+### file_format_type [string]
+
+文件类型，支持以下文件类型：
+
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
+
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
+
 ### file_filter_pattern [string]
 
 过滤模式，用于过滤文件。
diff --git a/docs/zh/connector-v2/source/S3File.md 
b/docs/zh/connector-v2/source/S3File.md
index 627875ff4c..fdd6027882 100644
--- a/docs/zh/connector-v2/source/S3File.md
+++ b/docs/zh/connector-v2/source/S3File.md
@@ -34,6 +34,7 @@ import ChangeLog from '../changelog/connector-file-s3.md';
     - [x] excel
     - [x] xml
     - [x] binary
+    - [x] markdown
 
 ## 描述
 
@@ -191,7 +192,7 @@ schema {
 | 名称                              | 类型      | 是否必需 | 默认值                       
                            | 描述                                                
                                                                                
                                                                                
                                                                                
                    |
 
|---------------------------------|---------|------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | path                            | string  | 是    | -                         
                            | 
需要读取的s3路径，可以有子路径，但子路径需要满足一定的格式要求。具体要求可以参考"parse_partition_from_path"选项          
                                                                                
                                                                                
                                                                      |
-| file_format_type                | string  | 是    | -                         
                            | 文件类型，支持以下文件类型：`text` `csv` `parquet` `orc` `json` 
`excel` `xml` `binary`                                                          
                                                                                
                                                                                
                    |
+| file_format_type                | string  | 是    | -                         
                            | 文件类型，支持以下文件类型：`text` `csv` `parquet` `orc` `json` 
`excel` `xml` `binary` `markdown`                                               
                                                                                
                                                                                
                               |
 | bucket                          | string  | 是    | -                         
                            | 
s3文件系统的bucket地址，例如：`s3n://seatunnel-test`，如果您使用`s3a`协议，此参数应为`s3a://seatunnel-test`。
                                                                                
                                                                                
                                                                   |
 | fs.s3a.endpoint                 | string  | 是    | -                         
                            | fs s3a端点                                          
                                                                                
                                                                                
                                                                                
                    |
 | fs.s3a.aws.credentials.provider | string  | 是    | 
com.amazonaws.auth.InstanceProfileCredentialsProvider | 
s3a的认证方式。我们目前只支持`org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider`和`com.amazonaws.auth.InstanceProfileCredentialsProvider`。有关凭据提供程序的更多信息，您可以查看[Hadoop
 
AWS文档](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Simple_name.2Fsecret_credentials_with_SimpleAWSCredentialsProvider.2A)
 |
@@ -327,6 +328,26 @@ schema {
 
 是否将完整文件作为单个块读取，而不是分割成块。启用时，整个文件内容将一次性读入内存。默认为false。
 
+### file_format_type [string]
+
+文件类型，支持以下文件类型：
+
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
+
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
+
 ## 示例
 
 1. 在此示例中，我们从s3路径`s3a://seatunnel-test/seatunnel/text`读取数据，此路径中的文件类型是orc。
diff --git a/docs/zh/connector-v2/source/SftpFile.md 
b/docs/zh/connector-v2/source/SftpFile.md
index a7463f96bd..7dbe5ea375 100644
--- a/docs/zh/connector-v2/source/SftpFile.md
+++ b/docs/zh/connector-v2/source/SftpFile.md
@@ -29,6 +29,7 @@ import ChangeLog from '../changelog/connector-file-sftp.md';
   - [x] excel
   - [x] xml
   - [x] binary
+  - [x] markdown
 
 ## 描述
 
@@ -166,7 +167,7 @@ import ChangeLog from '../changelog/connector-file-sftp.md';
 ### file_format_type [string]
 
 文件类型，支持以下文件类型：
-`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`
+`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary` `markdown`
 如果您将文件类型指定为`json`，您还应该指定schema选项来告诉连接器如何将数据解析为您想要的行。
 例如：
 上游数据如下：
@@ -231,6 +232,20 @@ schema {
 
 如果您将文件类型指定为`binary`，SeaTunnel可以同步任何格式的文件，
 例如压缩包、图片等。简而言之，任何文件都可以同步到目标位置。
+
+如果您将文件类型指定为 `markdown`，SeaTunnel 可以解析 markdown 文件并提取结构化数据。
+markdown 解析器提取各种元素，包括标题、段落、列表、代码块、表格等。
+每个元素都转换为具有以下架构的行：
+- `element_id`：元素的唯一标识符
+- `element_type`：元素类型（Heading、Paragraph、ListItem 等）
+- `heading_level`：标题级别（1-6，非标题元素为 null）
+- `text`：元素的文本内容
+- `page_number`：页码（默认：1）
+- `position_index`：文档中的位置索引
+- `parent_id`：父元素的 ID
+- `child_ids`：子元素 ID 的逗号分隔列表
+
+注意：Markdown 格式仅支持读取，不支持写入。
 在此要求下，您需要确保源和接收器同时使用`binary`格式进行文件同步。
 
 ### compress_codec [string]

(seatunnel) branch dev updated: [Feature][File] Add markdown parser documentation (#9834)

Reply via email to