This is an automated email from the ASF dual-hosted git repository.
fanjia pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/seatunnel.git
The following commit(s) were added to refs/heads/dev by this push:
new 40378f507d [Doc][Improve] support chinese
[docs/zh/connector-v2/source/FakeSource.md] (#8847)
40378f507d is described below
commit 40378f507dd085f5fb480efa6cd3959088e26eab
Author: Scorpio777888 <[email protected]>
AuthorDate: Thu Feb 27 10:03:37 2025 +0800
[Doc][Improve] support chinese [docs/zh/connector-v2/source/FakeSource.md]
(#8847)
Co-authored-by: Gemini147258 <[email protected]>
---
docs/zh/connector-v2/source/FakeSource.md | 541 ++++++++++++++++++++++++++++++
1 file changed, 541 insertions(+)
diff --git a/docs/zh/connector-v2/source/FakeSource.md
b/docs/zh/connector-v2/source/FakeSource.md
new file mode 100644
index 0000000000..c4515d17f7
--- /dev/null
+++ b/docs/zh/connector-v2/source/FakeSource.md
@@ -0,0 +1,541 @@
+# FakeSource
+
+> FakeSource 连接器
+
+## 支持的引擎
+
+> Spark<br/>
+> Flink<br/>
+> SeaTunnel Zeta<br/>
+
+## 描述
+
+FakeSource 是一个虚拟数据源,它根据用户定义的 schema 数据结构随机生成指定数量的行数据,主要用于类型转换或连接器新功能测试等测试场景。
+
+## 主要特性
+
+- [x] [批处理](../../concept/connector-v2-features.md)
+- [x] [流处理](../../concept/connector-v2-features.md)
+- [ ] [精确一次](../../concept/connector-v2-features.md)
+- [x] [列投影](../../concept/connector-v2-features.md)
+- [ ] [并行度](../../concept/connector-v2-features.md)
+- [ ] [支持用户自定义分片](../../concept/connector-v2-features.md)
+
+## 数据源选项
+
+| 名称 | 类型 | 必填 | 默认值 | 描述
|
+|---------------------------|---------|------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| tables_configs | list | 否 | - | 定义多个
FakeSource,每个项可以包含完整的 FakeSource 配置描述
|
+| schema | config | 是 | - | 定义 Schema 信息
|
+| rows | config | 否 | - | 每个并行度输出的伪数据行列表,详见标题
`Options rows Case`
|
+| row.num | int | 否 | 5 | 每个并行度生成的数据总行数
|
+| split.num | int | 否 | 1 | 枚举器为每个并行度生成的分片数量
|
+| split.read-interval | long | 否 | 1 | 读取器在两个分片读取之间的间隔时间(毫秒)
|
+| map.size | int | 否 | 5 | 连接器生成的 `map` 类型的大小
|
+| array.size | int | 否 | 5 | 连接器生成的 `array` 类型的大小
|
+| bytes.length | int | 否 | 5 | 连接器生成的 `bytes` 类型的长度
|
+| string.length | int | 否 | 5 | 连接器生成的 `string` 类型的长度
|
+| string.fake.mode | string | 否 | range | 生成字符串数据的伪数据模式,支持
`range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置 `string.template` 选项
|
+| string.template | list | 否 | - |
连接器生成的字符串类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| tinyint.fake.mode | string | 否 | range | 生成 tinyint 数据的伪数据模式,支持
`range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置 `tinyint.template` 选项
|
+| tinyint.min | tinyint | 否 | 0 | 连接器生成的 tinyint 数据的最小值
|
+| tinyint.max | tinyint | 否 | 127 | 连接器生成的 tinyint 数据的最大值
|
+| tinyint.template | list | 否 | - | 连接器生成的 tinyint
类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| smallint.fake.mode | string | 否 | range | 生成 smallint
数据的伪数据模式,支持 `range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置
`smallint.template` 选项
|
+| smallint.min | smallint| 否 | 0 | 连接器生成的 smallint 数据的最小值
|
+| smallint.max | smallint| 否 | 32767 | 连接器生成的 smallint 数据的最大值
|
+| smallint.template | list | 否 | - | 连接器生成的 smallint
类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| int.fake.template | string | 否 | range | 生成 int 数据的伪数据模式,支持
`range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置 `int.template` 选项
|
+| int.min | smallint | 否 | 0 | 连接器生成的 int 数据的最小值
|
+| int.max | smallint | 否 | 0x7fffffff | 连接器生成的 int
数据的最大值
|
+| int.template | list | 否 | - | 连接器生成的 int
类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| bigint.fake.mode | string | 否 | range | 生成 bigint 数据的伪数据模式,支持
`range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置 `bigint.template` 选项
|
+| bigint.min | bigint | 否 | 0 | 连接器生成的 bigint 数据的最小值
|
+| bigint.max | bigint | 否 | 0x7fffffffffffffff | 连接器生成的
bigint 数据的最大值
|
+| bigint.template | list | 否 | - | 连接器生成的 bigint
类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| float.fake.mode | string | 否 | range | 生成 float 数据的伪数据模式,支持
`range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置 `float.template` 选项
|
+| float.min | float | 否 | 0 | 连接器生成的 float 数据的最小值
|
+| float.max | float | 否 | 0x1.fffffeP+127 | 连接器生成的 float
数据的最大值
|
+| float.template | list | 否 | - | 连接器生成的 float
类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| double.fake.mode | string | 否 | range | 生成 double 数据的伪数据模式,支持
`range` 和 `template`,默认为 `range`,如果配置为 `template`,用户还需配置 `double.template` 选项
|
+| double.min | double | 否 | 0 | 连接器生成的 double 数据的最小值
|
+| double.max | double | 否 | 0x1.fffffffffffffP+1023 | 连接器生成的
double 数据的最大值
|
+| double.template | list | 否 | - | 连接器生成的 double
类型的模板列表,如果用户配置了此选项,连接器将从模板列表中随机选择一个项
|
+| vector.dimension | int | 否 | 4 | 生成的向量的维度,不包括二进制向量
|
+| binary.vector.dimension | int | 否 | 8 | 生成的二进制向量的维度
|
+| vector.float.min | float | 否 | 0 | 连接器生成的向量中 float 数据的最小值
|
+| vector.float.max | float | 否 | 0x1.fffffeP+127 | 连接器生成的向量中
float 数据的最大值
|
+| common-options | | 否 | - | 数据源插件通用参数,详情请参考
[Source Common Options](../source-common-options.md)
|
+
+## 任务示例
+
+### 简单示例:
+
+> 此示例随机生成指定类型的数据。如果您想了解如何声明字段类型,请点击
[这里](../../concept/schema-feature.md#how-to-declare-type-supported)。
+
+```hocon
+schema = {
+ fields {
+ c_map = "map<string, array<int>>"
+ c_map_nest = "map<string, {c_int = int, c_string = string}>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_null = "null"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ c_row = {
+ c_map = "map<string, map<string, string>>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_null = "null"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ }
+ }
+}
+```
+
+### 随机生成
+
+> 随机生成 16 条符合类型的数据
+
+```hocon
+source {
+ # 这是一个示例输入插件,**仅用于测试和演示功能输入插件**
+ FakeSource {
+ row.num = 16
+ schema = {
+ fields {
+ c_map = "map<string, string>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_null = "null"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ }
+ }
+ plugin_output = "fake"
+ }
+}
+```
+
+### 自定义数据内容简单示例:
+
+> 这是一个自定义数据源信息的示例,定义每条数据是添加还是删除修改操作,并定义每个字段存储的内容
+
+```hocon
+source {
+ FakeSource {
+ schema = {
+ fields {
+ c_map = "map<string, string>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_null = "null"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ }
+ }
+ rows = [
+ {
+ kind = INSERT
+ fields = [{"a": "b"}, [101], "c_string", true, 117, 15987, 56387395,
7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null,
"bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
+ }
+ {
+ kind = UPDATE_BEFORE
+ fields = [{"a": "c"}, [102], "c_string", true, 117, 15987, 56387395,
7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null,
"bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
+ }
+ {
+ kind = UPDATE_AFTER
+ fields = [{"a": "e"}, [103], "c_string", true, 117, 15987, 56387395,
7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null,
"bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
+ }
+ {
+ kind = DELETE
+ fields = [{"a": "f"}, [104], "c_string", true, 117, 15987, 56387395,
7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null,
"bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
+ }
+ ]
+ }
+}
+```
+
+> 由于 [HOCON](https://github.com/lightbend/config/blob/main/HOCON.md)
规范的限制,用户无法直接创建字节序列对象。FakeSource 使用字符串来分配 `bytes` 类型的值。在上面的示例中,`bytes` 类型字段被分配了
`"bWlJWmo="`,这是通过 **base64** 编码的 "miIZj"。因此,在为 `bytes` 类型字段赋值时,请使用 **base64**
编码的字符串。
+
+### 指定数据数量简单示例:
+
+> 此案例指定生成数据的数量以及生成值的长度
+
+```hocon
+FakeSource {
+ row.num = 10
+ map.size = 10
+ array.size = 10
+ bytes.length = 10
+ string.length = 10
+ schema = {
+ fields {
+ c_map = "map<string, array<int>>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_null = "null"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ c_row = {
+ c_map = "map<string, map<string, string>>"
+ c_array = "array<int>"
+ c_string = string
+ c_boolean = boolean
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ c_decimal = "decimal(30, 8)"
+ c_null = "null"
+ c_bytes = bytes
+ c_date = date
+ c_timestamp = timestamp
+ }
+ }
+ }
+}
+```
+
+### 模板数据简单示例:
+
+> 根据指定模板随机生成
+
+使用模板
+
+```hocon
+FakeSource {
+ row.num = 5
+ string.fake.mode = "template"
+ string.template = ["tyrantlucifer", "hailin", "kris", "fanjia", "zongwen",
"gaojun"]
+ tinyint.fake.mode = "template"
+ tinyint.template = [1, 2, 3, 4, 5, 6, 7, 8, 9]
+ smalling.fake.mode = "template"
+ smallint.template = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
+ int.fake.mode = "template"
+ int.template = [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
+ bigint.fake.mode = "template"
+ bigint.template = [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
+ float.fake.mode = "template"
+ float.template = [40.0, 41.0, 42.0, 43.0]
+ double.fake.mode = "template"
+ double.template = [44.0, 45.0, 46.0, 47.0]
+ schema {
+ fields {
+ c_string = string
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ }
+ }
+}
+```
+
+### 范围数据简单示例:
+
+> 在指定的数据生成范围内随机生成
+
+```hocon
+FakeSource {
+ row.num = 5
+ string.template = ["tyrantlucifer", "hailin", "kris", "fanjia", "zongwen",
"gaojun"]
+ tinyint.min = 1
+ tinyint.max = 9
+ smallint.min = 10
+ smallint.max = 19
+ int.min = 20
+ int.max = 29
+ bigint.min = 30
+ bigint.max = 39
+ float.min = 40.0
+ float.max = 43.0
+ double.min = 44.0
+ double.max = 47.0
+ schema {
+ fields {
+ c_string = string
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ }
+ }
+}
+```
+
+
+### 生成多张表
+
+> 这是一个生成多数据源测试表 `test.table1` 和 `test.table2` 的示例
+
+```hocon
+FakeSource {
+ tables_configs = [
+ {
+ row.num = 16
+ schema {
+ table = "test.table1"
+ fields {
+ c_string = string
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ }
+ }
+ },
+ {
+ row.num = 17
+ schema {
+ table = "test.table2"
+ fields {
+ c_string = string
+ c_tinyint = tinyint
+ c_smallint = smallint
+ c_int = int
+ c_bigint = bigint
+ c_float = float
+ c_double = double
+ }
+ }
+ }
+ ]
+}
+```
+
+### `rows` 选项示例
+
+```hocon
+rows = [
+ {
+ kind = INSERT
+ fields = [1, "A", 100]
+ },
+ {
+ kind = UPDATE_BEFORE
+ fields = [1, "A", 100]
+ },
+ {
+ kind = UPDATE_AFTER
+ fields = [1, "A_1", 100]
+ },
+ {
+ kind = DELETE
+ fields = [1, "A_1", 100]
+ }
+]
+```
+
+### `table-names` 选项示例
+
+```hocon
+source {
+ # 这是一个示例源插件,**仅用于测试和演示源插件功能**
+ FakeSource {
+ table-names = ["test.table1", "test.table2", "test.table3"]
+ parallelism = 1
+ schema = {
+ fields {
+ name = "string"
+ age = "int"
+ }
+ }
+ }
+}
+```
+
+### `defaultValue` 选项示例
+
+可以通过 `row` 和 `columns` 生成自定义数据。对于时间类型,可以通过
`CURRENT_TIMESTAMP`、`CURRENT_TIME`、`CURRENT_DATE` 获取当前时间。
+
+```hocon
+ schema = {
+ fields {
+ pk_id = bigint
+ name = string
+ score = int
+ time1 = timestamp
+ time2 = time
+ time3 = date
+ }
+ }
+ # 使用 rows
+ rows = [
+ {
+ kind = INSERT
+ fields = [1, "A", 100, CURRENT_TIMESTAMP, CURRENT_TIME,
CURRENT_DATE]
+ }
+ ]
+```
+
+```hocon
+ schema = {
+ # 使用 columns
+ columns = [
+ {
+ name = book_publication_time
+ type = timestamp
+ defaultValue = "2024-09-12 15:45:30"
+ comment = "书籍出版时间"
+ },
+ {
+ name = book_publication_time2
+ type = timestamp
+ defaultValue = CURRENT_TIMESTAMP
+ comment = "书籍出版时间2"
+ },
+ {
+ name = book_publication_time3
+ type = time
+ defaultValue = "15:45:30"
+ comment = "书籍出版时间3"
+ },
+ {
+ name = book_publication_time4
+ type = time
+ defaultValue = CURRENT_TIME
+ comment = "书籍出版时间4"
+ },
+ {
+ name = book_publication_time5
+ type = date
+ defaultValue = "2024-09-12"
+ comment = "书籍出版时间5"
+ },
+ {
+ name = book_publication_time6
+ type = date
+ defaultValue = CURRENT_DATE
+ comment = "书籍出版时间6"
+ }
+ ]
+ }
+```
+
+### 使用向量示例
+
+```hocon
+source {
+ FakeSource {
+ row.num = 10
+ # 低优先级
+ vector.dimension= 4
+ binary.vector.dimension = 8
+ # 低优先级
+ schema = {
+ table = "simple_example"
+ columns = [
+ {
+ name = book_id
+ type = bigint
+ nullable = false
+ defaultValue = 0
+ comment = "主键 ID"
+ },
+ {
+ name = book_intro_1
+ type = binary_vector
+ columnScale =8
+ comment = "向量"
+ },
+ {
+ name = book_intro_2
+ type = float16_vector
+ columnScale =4
+ comment = "向量"
+ },
+ {
+ name = book_intro_3
+ type = bfloat16_vector
+ columnScale =4
+ comment = "向量"
+ },
+ {
+ name = book_intro_4
+ type = sparse_float_vector
+ columnScale =4
+ comment = "向量"
+ }
+ ]
+ }
+ }
+}
+```
+
+## 更新日志
+
+### 2.2.0-beta 2022-09-26
+
+- 新增 FakeSource 源连接器
+
+### 2.3.0-beta 2022-10-20
+
+- [改进] 支持直接定义数据值(row)([2839](https://github.com/apache/seatunnel/pull/2839))
+- [改进] 改进 FakeSource
连接器:([2944](https://github.com/apache/seatunnel/pull/2944))
+ - 支持用户自定义 Map 大小
+ - 支持用户自定义数组大小
+ - 支持用户自定义字符串长度
+ - 支持用户自定义字节长度
+- [改进] 支持 FakeSource 连接器的多分片
([2974](https://github.com/apache/seatunnel/pull/2974))
+- [改进] 支持设置每个并行度的分片数量以及两个分片之间的读取间隔
([3098](https://github.com/apache/seatunnel/pull/3098))
+
+### 下一个版本
+
+- [功能] 支持配置假数据行 [3865](https://github.com/apache/seatunnel/pull/3865)
+- [功能] 支持为假数据配置模板或范围 [3932](https://github.com/apache/seatunnel/pull/3932)