This is an automated email from the ASF dual-hosted git repository.
dockerzhang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/inlong-website.git
The following commit(s) were added to refs/heads/master by this push:
new 713d2738aa [INLONG-682][Document] Sort out the format concept and
related attributes (#683)
713d2738aa is described below
commit 713d2738aa364be79b0d1a9c6a844034ed98a308
Author: feat <[email protected]>
AuthorDate: Thu Feb 9 22:02:17 2023 +0800
[INLONG-682][Document] Sort out the format concept and related attributes
(#683)
Co-authored-by: Charles Zhang <[email protected]>
---
docs/design_and_concept/img/format_and_flink.png | Bin 0 -> 86161 bytes
.../img/the_format_in_inlong.png | Bin 0 -> 42853 bytes
docs/design_and_concept/the_format_in_inlong.md | 95 ++++++++++++++++++++
.../design_and_concept/img/format_and_flink.png | Bin 0 -> 86161 bytes
.../img/the_format_in_inlong.png | Bin 0 -> 42853 bytes
.../design_and_concept/the_format_in_inlong.md | 100 +++++++++++++++++++++
6 files changed, 195 insertions(+)
diff --git a/docs/design_and_concept/img/format_and_flink.png
b/docs/design_and_concept/img/format_and_flink.png
new file mode 100644
index 0000000000..7d49580263
Binary files /dev/null and b/docs/design_and_concept/img/format_and_flink.png
differ
diff --git a/docs/design_and_concept/img/the_format_in_inlong.png
b/docs/design_and_concept/img/the_format_in_inlong.png
new file mode 100644
index 0000000000..dac80d2466
Binary files /dev/null and
b/docs/design_and_concept/img/the_format_in_inlong.png differ
diff --git a/docs/design_and_concept/the_format_in_inlong.md
b/docs/design_and_concept/the_format_in_inlong.md
new file mode 100644
index 0000000000..94d107ee62
--- /dev/null
+++ b/docs/design_and_concept/the_format_in_inlong.md
@@ -0,0 +1,95 @@
+---
+title: Format
+sidebar_position: 7
+---
+
+## What is format ?
+
+
+
+As shown in the figure, in Flink SQL, when reading and writing data, it adopts
the form of Row. Inside it is an Object array `Object[]`, and each element in
the array represents a field of the Flink table. The information about field
type , name and precision is marked by `Schema` .
+
+Format provides two interfaces : SerializationSchema and DeserializationSchema
:
+- When Flink writes data to MQ , it needs to serialize `Flink Row` into
`key-value` / `csv` / `Json` format . Then call the method of
`SerializationSchema#serialize` . Data will be serialized into Byte[] , which
can be written to MQ .
+- When Flink reads data from MQ , it works vice versa . It reads data from MQ
with format Byte[] . Then deserializes them into Format and finally converts
them into Flink row .
+
+> See
+> details:
[`inlong-sort/sort-formats`](https://github.com/apache/inlong/tree/release-1.5.0/inlong-sort/sort-formats)
+
+## The format in InLong
+
+
+
+InLong serves as a one-stop data integration platform , with MQ (the Cache
part in the picture) as the transmission channel , which decouples DataProxy
and Sort and provides better scalability . When DataProxy is reporting data ,
it needs to serialize the data into corresponding format (
`SerializationSchema#serialize` ) . When Sort receives data, it will
deserialize the MQ's data ( `DeserializationSchema#deserialize` ) into `Flink
Row` , and then write to the corresponding storage using [...]
+
+## What are the formats?
+
+Currently , InLong-sort provides CSV / KeyValue / JSON , and the corresponding
InLongMsg packaging format .
+
+### CSV
+
+```xml
+<dependency>
+<groupId>org.apache.inlong</groupId>
+<artifactId>sort-format-csv</artifactId>
+<version>${inlong.version}</version>
+</dependency>
+```
+
+`org.apache.inlong.sort.formats.kv.KvFormatFactory`
+
+| Option | Type | Required | Default value
| Advanced | Remark
|
+|---------------------------|---------|----------|------------------------------------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `format.delimiter` | char | Y | `,`
| N |
|
+| `format.escape-character` | char | N | disabled
| Y |
|
+| `format.quote-character` | char | N | disabled
| Y |
|
+| `format.null-literal` | String | N | disabled
| Y |
|
+| `format.charset` | String | Y | "UTF-8"
| N |
|
+| `format.ignore-errors` | Boolean | Y | true
| N |
|
+| `format.derive_schema` | Boolean | N | Required if no format
schema is defined . | Y | Derives the format schema from the table's
schema . This allows for defining schema information only once . <br/> The
names , types , and fields' order of the format are determined by the table's
schema . <br/> Time attributes are ignored if their origin is not a field .
<br/> A "from" definition is interpreted as a field renaming in the format . |
+
+### Key-Value
+
+```xml
+<dependency>
+<groupId>org.apache.inlong</groupId>
+<artifactId>sort-format-kv</artifactId>
+<version>${inlong.version}</version>
+</dependency>
+```
+
+`org.apache.inlong.sort.formats.csv.CsvFormatFactory`
+
+| Option | Type | Required | Default value
| Advanced | Remark
|
+|---------------------------|---------|----------|------------------------------------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `format.entry-delimiter` | char | N | '&'
| N |
|
+| `format.kv-delimiter` | char | N | '='
| N |
|
+| `format.escape-character` | char | N | disabled
| Y |
|
+| `format.quote-character` | char | N | disabled
| Y |
|
+| `format.null-literal` | char | N | disabled
| Y |
|
+| `format.charset` | String | Y | "UTF-8"
| N |
|
+| `format.ignore-errors` | Boolean | Y | true
| N |
|
+| `format.derive_schema` | Boolean | N | Required if no format
schema is defined . | Y | Derives the format schema from the table's
schema . This allows for defining schema information only once . <br/> The
names , types , and fields' order of the format are determined by the table's
schema . <br/> Time attributes are ignored if their origin is not a field .
<br/> A "from" definition is interpreted as a field renaming in the format . |
+
+### JSON
+
+```xml
+<dependency>
+<groupId>org.apache.flink</groupId>
+<artifactId>flink-json</artifactId>
+<version>${flink.version}</version>
+</dependency>
+```
+
+`org.apache.flink.formats.json.JsonFormatFactory`
+
+`org.apache.flink.formats.json.JsonOptions`
+
+| Option | Type | Required | Default value |
Advanced | Remark
|
+|----------------------------------|---------|----------|---------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `ignore-parse-errors` | Boolean | N | false | N
| Optional flag to skip fields and rows with parse errors instead of
failing ; <br/> fields are set to null in case of errors , false by default .
|
+| `map-null-key.mode` | String | N | "FAIL" | Y
| Optional flag to control the handling mode when serializing null key for
map data ." <br/> Option DROP will drop null key entries for map data ." <br/>
Option LITERAL will use 'map-null-key.literal' as key literal .
|
+| `map-null-key.literal` | String | N | "null" | Y
| Optional flag to specify string literal for null keys when
'map-null-key.mode' is LITERAL .
|
+| `encode.decimal-as-plain-number` | Boolean | N | false | Y
| Optional flag to specify whether to encode all decimals as plain numbers
instead of possible scientific notations , false by default .
|
+| `timestamp-format.standard` | String | N | "SQL" | Y
| Optional flag to specify timestamp format , SQL by default ."<br/> Option
ISO-8601 will parse input timestamp in "yyyy-MM-ddTHH:mm:ss.s{precision}"
format and output timestamp in the same format ."<br/> Option SQL will parse
input timestamp in "yyyy-MM-dd HH:mm:ss.s{precision}" format and output
timestamp in the same format . |
+| `encode.decimal-as-plain-number` | Boolean | N | false | Y
| Optional flag to specify whether to encode all decimals as plain numbers
instead of possible scientific notations , false by default .
|
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/img/format_and_flink.png
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/img/format_and_flink.png
new file mode 100644
index 0000000000..7d49580263
Binary files /dev/null and
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/img/format_and_flink.png
differ
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/img/the_format_in_inlong.png
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/img/the_format_in_inlong.png
new file mode 100644
index 0000000000..dac80d2466
Binary files /dev/null and
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/img/the_format_in_inlong.png
differ
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/the_format_in_inlong.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/the_format_in_inlong.md
new file mode 100644
index 0000000000..00c8b2c14a
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/the_format_in_inlong.md
@@ -0,0 +1,100 @@
+---
+title: Format
+sidebar_position: 7
+---
+
+## 什么是 Format?
+
+
+
+如上图所示, Flink SQL 在读写数据时,均采用 Row 的形式,其内部为 Object 数组 `Object[]`,数组中每个元素代表了一个
Flink 表的字段。
+字段的类型、名称、精度等信息,通过 `Schema` 来标示。
+
+Flink 的 Format 提供了两种接口:SerializationSchema 和 DeserializationSchema。
+
+- 当 Flink 往 MQ 写数据时,需要把 `Flink Row` 序列化为 `key-value` / `csv` / `Json` 等 Format,
+ 这时调用了 `SerializationSchema#serialize` 方法,数据会序列化成 `Byte[]`,写入到 `MQ`。
+- 当 Flink 读取 MQ 的数据时,该过程则相反:从 MQ 读取数据,数据格式为 `byte[]`,反序列化为`Format`,再转换为 `Flink
row`。
+
+> 详情请查看代码
[`inlong-sort/sort-formats`](https://github.com/apache/inlong/tree/release-1.5.0/inlong-sort/sort-formats)
+
+## InLong 中的 Format
+
+
+
+InLong 作为一站式的数据集成平台,将 MQ(图中 Cache 部分)作为传输通道,同时实现 DataProxy 与 Sort 的解耦,扩展性会更强:
+
+- DataProxy 上报数据时,需要将数据序列化成对应的格式(`SerializationSchema#serialize`)。
+- Sort 接收到数据,将 MQ 的数据反序列化(`DeserializationSchema#deserialize`)成 `Flink Row`
,通过 Flink SQL 写入到对应的存储。
+
+## 有哪些 Format?
+
+目前,InLong-Sort 提供了 CSV/KeyValue/JSON,以及通过 InLongMsg 封装的格式。
+
+### CSV
+
+```xml
+<dependency>
+<groupId>org.apache.inlong</groupId>
+<artifactId>sort-format-csv</artifactId>
+<version>${inlong.version}</version>
+</dependency>
+```
+
+`org.apache.inlong.sort.formats.kv.KvFormatFactory`
+
+| Option | Type | Required | Default value |
高级属性 | Remark
|
+|---------------------------|---------|----------|--------------------------|------|----------------------------------------------------------------------------------------------------------------------------------------|
+| `format.delimiter` | char | Y | `,` |
N |
|
+| `format.escape-character` | char | N | disabled |
Y |
|
+| `format.quote-character` | char | N | disabled |
Y |
|
+| `format.null-literal` | String | N | disabled |
Y |
|
+| `format.charset` | String | Y | "UTF-8" |
N |
|
+| `format.ignore-errors` | Boolean | Y | true |
N |
|
+| `format.derive_schema` | Boolean | N | 如果未定义 Format Schema,则为必需。 |
Y | 从表的 Schema 中派生 Format Schema 。 这允许只定义一次schema 信息。 <br/> format
的名称、类型和字段顺序由表的 schema 决定。 <br/>如果时间属性不是字段,则忽略它们。 <br/> “from” 定义被解释为 format
中的字段重命名。 |
+
+### Key-Value
+
+```xml
+<dependency>
+<groupId>org.apache.inlong</groupId>
+<artifactId>sort-format-kv</artifactId>
+<version>${inlong.version}</version>
+</dependency>
+```
+
+`org.apache.inlong.sort.formats.csv.CsvFormatFactory`
+
+| Option | Type | Required | Default value
| 高级属性 | Remark
|
+|---------------------------|---------|----------|------------------------------------------|------|----------------------------------------------------------------------------------------------------------------------------------------|
+| `format.entry-delimiter` | char | N | '&'
| N |
|
+| `format.kv-delimiter` | char | N | '='
| N |
|
+| `format.escape-character` | char | N | disabled
| Y |
|
+| `format.quote-character` | char | N | disabled
| Y |
|
+| `format.null-literal` | char | N | disabled
| Y |
|
+| `format.charset` | String | Y | "UTF-8"
| N |
|
+| `format.ignore-errors` | Boolean | Y | true
| N |
|
+| `format.derive_schema` | Boolean | N | Required if no format
schema is defined. | Y | 从表的 Schema 中派生 Format Schema 。 这允许只定义一次schema信息。
<br/> format 的名称、类型和字段顺序由表的 schema 决定。 <br/>如果时间属性不是字段,则忽略它们。 <br/> “from”
定义被解释为 format 中的字段重命名。 |
+
+### JSON
+
+```xml
+<dependency>
+<groupId>org.apache.flink</groupId>
+<artifactId>flink-json</artifactId>
+<version>${flink.version}</version>
+</dependency>
+```
+
+`org.apache.flink.formats.json.JsonFormatFactory`
+
+`org.apache.flink.formats.json.JsonOptions`
+
+| Option | Type | Required | Default value | 高级属性
| Remark
|
+|----------------------------------|---------|----------|---------------|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `ignore-parse-errors` | Boolean | N | false | N
| 可选标志以跳过具有解析错误而不是失败的字段和行; <br/>如果出现错误,字段设置为 null,默认情况下为 false。
|
+| `map-null-key.mode` | String | N | "FAIL" | Y
| 可选标志,用于在序列化map数据的空键时控制处理模式。<br/><br/>选项 DROP 将删除map数据的空键条目。<br/>选项 LITERAL
将使用 'map-null-key.literal' 作为 key 关键字。
|
+| `map-null-key.literal` | String | N | "null" | Y
| 当“map-null-key.mode”为 LITERAL 时,用于为空键指定字符串文字的可选标志。
|
+| `encode.decimal-as-plain-number` | Boolean | N | false | Y
| 可选标志,用于指定是否将所有小数编码为普通数字而不是科学记数法,默认情况下为 false。
|
+| `timestamp-format.standard` | String | N | "SQL" | Y
| 用于指定时间戳格式的可选标志,默认为 SQL。<br/>选项 ISO-8601 将以“yyyy-MM-ddTHH:mm:ss.s{precision}”
格式解析输入时间戳,并以相同格式输出时间戳。 <br/>选项 SQL 将以“yyyy-MM-dd
HH:mm:ss.s{precision}”格式解析输入时间戳,并以相同格式输出时间戳。 |
+| `encode.decimal-as-plain-number` | Boolean | N | false | Y
| 可选标志,用于指定是否将所有小数编码为普通数字而不是可能的科学记数法,默认情况下为 `false`。
|