This is an automated email from the ASF dual-hosted git repository.

eldenmoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 1ac0f836fac [improve](variant) add varaint doc mode (#3253)
1ac0f836fac is described below

commit 1ac0f836fac17c4e3d372bcdb766e6d56f96b82e
Author: Sun Chenyang <[email protected]>
AuthorDate: Wed Jan 7 22:02:22 2026 +0800

    [improve](variant) add varaint doc mode (#3253)
    
    ## Versions
    
    - [x] dev
    - [ ] 4.x
    - [ ] 3.x
    - [ ] 2.1
    
    ## Languages
    
    - [x] Chinese
    - [x] English
    
    ## Docs Checklist
    
    - [ ] Checked by AI
    - [ ] Test Cases Built
---
 .../sql-data-types/semi-structured/VARIANT.md      | 79 ++++++++++++++++++++--
 .../sql-data-types/semi-structured/VARIANT.md      | 77 +++++++++++++++++++--
 2 files changed, 148 insertions(+), 8 deletions(-)

diff --git 
a/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md 
b/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
index ec899292e9e..43752c3cdad 100644
--- a/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
+++ b/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
@@ -2,7 +2,7 @@
 {
     "title": "VARIANT",
     "language": "en-US",
-    "description": "The VARIANT type stores semi-structured JSON data. It can 
contain different primitive types (integers, strings, booleans, etc.),"
+    "description": "The VARIANT type stores semi-structured JSON data. It can 
contain different primitive types (integers, strings, booleans, etc.), 
one-dimensional arrays, and nested objects. On write, Doris infers the 
structure and type of sub-paths based on JSON paths and materializes frequent 
paths as independent subcolumns, leveraging columnar storage and vectorized 
execution for both flexibility and performance."
 }
 ---
 
@@ -391,10 +391,57 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
 | `VARCHAR`       | ✔        | ✔         |
 | `JSON`          | ✔        | ✔         |
 
+## Wide columns
+
+When ingested data contains many distinct JSON keys, VARIANT materialized 
subcolumns can grow rapidly; at scale this may cause metadata bloat, higher 
write/merge cost, and query slowdowns. To address “wide columns” (too many 
subcolumns), VARIANT provides two mechanisms: **Sparse columns** and **DOC 
encoding**.
+
+Note: these two mechanisms are mutually exclusive—enabling DOC encoding 
disables sparse columns, and vice versa.
+
+### Sparse columns
+
+**How it works**
+
+- The system ranks paths by non-null ratio / sparsity: high-frequency 
(less-sparse) paths are materialized as independent subcolumns; remaining 
low-frequency (sparse) paths are merged and stored in sparse columns. The 
maximum number of materialized subcolumns is controlled by 
`variant_max_subcolumns_count`.
+- If a path is declared in a Schema Template, by default it will not be moved 
into sparse columns; set `variant_enable_typed_paths_to_sparse` to allow typed 
paths to be moved into sparse columns.
+- Sparse columns support sharding: distribute sparse subpaths across multiple 
sparse columns to reduce per-column read overhead and improve read efficiency. 
Use `variant_sparse_hash_shard_count` to specify how many sparse columns are 
physically stored.
+
+**When to use**
+
+- Many JSON keys overall, but queries mostly target a small subset of 
high-frequency fields (hot fields).
+- Highly skewed key distribution (a few keys appear frequently; many keys 
appear occasionally): you want good hot-path performance while keeping 
long-tail keys queryable (typically slower).
+
+**Limitations and configuration notes**
+
+- If most keys have similar non-null ratios (little sparsity contrast), it’s 
hard to identify truly sparse paths and the benefit of sparse columns is 
reduced.
+- `variant_max_subcolumns_count` (materialized subcolumns) is recommended to 
stay ≤ 10000.
+- If you have high query requirements on typed paths (declared via Schema 
Template), prefer `variant_enable_typed_paths_to_sparse = false`.
+- `variant_sparse_hash_shard_count` can be roughly estimated as “number of 
sparse paths / 128”. Example: total JSON keys ≈ 10,000, 
`variant_max_subcolumns_count = 2000`, then sparse paths ≈ 8000, so 
`variant_sparse_hash_shard_count` can start around `8000/128`.
+
+### DOC encoding (DOC mode)
+
+**How it works**
+
+- Paths can still be materialized as independent subcolumns for path-based 
queries, and the original JSON is additionally stored as a stored field to 
return the full JSON document efficiently.
+- DOC encoding supports sharding: the original JSON is split into multiple 
columns for storage and reassembled when querying the full JSON. Use 
`variant_doc_hash_shard_count` to specify the number of DOC shards.
+- For small-batch writes, subcolumns can be skipped and materialized later 
during merges. This is controlled by `variant_doc_materialization_min_rows`. 
For example, if `variant_doc_materialization_min_rows = 10000`, writes below 
10,000 rows will only store the original JSON and won’t materialize subcolumns 
for that batch.
+
+**When to use**
+
+- Many distinct keys overall, but each row contains only a small subset of 
keys (e.g. per-row keys < 5% of total keys): typical sparse wide-column 
workloads.
+- Workloads frequently need the full JSON document (e.g. `SELECT *` / full-row 
return) and you want to avoid assembling JSON from a large number of subcolumns.
+- You want to reduce subcolumn materialization overhead for small batches and 
defer it to later merges.
+- You can accept additional storage cost (because the original JSON is stored 
as a stored field).
+
+**Limitations and configuration notes**
+
+- DOC mode requires `variant_enable_doc_mode = true`.
+- In DOC mode, typed paths declared via Schema Template are limited to 
numeric, string, and array types.
+- `variant_doc_hash_shard_count` can be roughly estimated as “total JSON keys 
/ 128”.
+
+See the “Configuration” section below for the full property list.
+
 ## Limitations
 
-- `variant_max_subcolumns_count`: default 0 (no limit). In production, set to 
2048 (tablet level) to control the number of materialized paths. Above the 
threshold, low-frequency/sparse paths are moved to a shared data structure; 
reading from it may be slower (see “Configuration”).
-- If a path type is specified via Schema Template, that path will be forced to 
be materialized; when `variant_enable_typed_paths_to_sparse = true`, it also 
counts toward the threshold and may be moved to the shared structure.
 - JSON key length ≤ 255.
 - Cannot be a primary key or sort key.
 - Cannot be nested within other types (e.g., `Array<Variant>`, 
`Struct<Variant>`).
@@ -428,7 +475,8 @@ CREATE TABLE example_table (
       'path_2' : STRING,
       properties(
           'variant_max_subcolumns_count' = '2048',
-          'variant_enable_typed_paths_to_sparse' = 'true'
+          'variant_enable_typed_paths_to_sparse' = 'true',
+          'variant_sparse_hash_shard_count' = '64'
       )
   >
 );
@@ -438,6 +486,29 @@ CREATE TABLE example_table (
 <tr><td>Property<br/></td><td>Description<br/></td></tr>
 <tr><td>`variant_max_subcolumns_count`<br/></td><td>Max number of materialized 
paths. Above the threshold, new paths may be stored in a shared data structure. 
Default 0 (unlimited). Recommended 2048; do not exceed 10000.<br/></td></tr>
 <tr><td>`variant_enable_typed_paths_to_sparse`<br/></td><td>By default, typed 
paths are always materialized (and do not count against 
`variant_max_subcolumns_count`). When set to `true`, typed paths also count 
toward the threshold and may be moved to the shared structure.<br/></td></tr>
+<tr><td>`variant_sparse_hash_shard_count`<br/></td><td>Shard count for sparse 
columns. Distributes sparse subpaths across multiple sparse columns to improve 
read performance. Default 1; tune based on the number of sparse 
subpaths.<br/></td></tr>
+</table>
+
+```sql
+CREATE TABLE example_table (
+  id INT,
+  data_variant VARIANT<
+      'path_1' : INT,
+      'path_2' : STRING,
+      properties(
+          'variant_enable_doc_mode' = 'true',
+          'variant_doc_materialization_min_rows' = '10000',
+          'variant_doc_hash_shard_count' = '64'
+      )
+  >
+);
+```
+
+<table>
+<tr><td>Property<br/></td><td>Description<br/></td></tr>
+<tr><td>`variant_enable_doc_mode`<br/></td><td>Enable DOC encoding mode. When 
`true`, the original JSON is stored as a stored field to quickly return the 
whole JSON document. DOC mode is mutually exclusive with sparse columns. 
Default `false`.<br/></td></tr>
+<tr><td>`variant_doc_materialization_min_rows`<br/></td><td>Minimum row 
threshold to materialize subcolumns in DOC mode. When rows are below this 
value, only the original JSON is stored; after compaction merges files to reach 
the threshold, subcolumns are materialized. Helps reduce overhead for 
small-batch writes.<br/></td></tr>
+<tr><td>`variant_doc_hash_shard_count`<br/></td><td>Shard count for DOC 
encoding. The original JSON is split into the specified number of columns for 
storage and reassembled when querying the whole JSON. Default 64; tune based on 
JSON size and concurrency.<br/></td></tr>
 </table>
 
 Behavior at limits and tuning suggestions:
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
index aa89472923f..a187bb63ae5 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
@@ -391,10 +391,55 @@ SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
 | `VARCHAR`       | ✔        | ✔         |
 | `JSON`          | ✔        | ✔         |
 
+## 宽列
+
+当导入数据包含大量不同的 JSON key 时,VARIANT 
的子列会迅速增多;当规模达到一定程度,可能出现元数据膨胀、写入/合并开销增大、查询性能下降等问题。为应对“宽列”(子列过多),VARIANT 
提供两种机制:**稀疏列** 与 **DOC 编码**。
+
+注意:这两种机制**互斥**——启用 DOC 编码后将无法使用稀疏列机制,反之亦然。
+
+### 稀疏列机制
+
+**机制说明**
+
+- 系统会按“非空比例/稀疏度”对路径排序:高频(不稀疏)路径优先物化为独立子列;可物化的最大子列数量由 
`variant_max_subcolumns_count` 指定,其余低频(稀疏)路径会被合并存放到稀疏列中。
+- 如果对子列指定了预定义 Schema,默认情况下该子列不会被放入稀疏列中;可通过 
`variant_enable_typed_paths_to_sparse` 允许“指定了类型”的子列进入稀疏列。
+- 稀疏列支持 sharding:通过将稀疏子路径分散到多个稀疏列中,降低单列读取负担、提升读取效率;可通过 
`variant_sparse_hash_shard_count` 指定稀疏列的实际存储个数。
+
+**适用场景**
+- JSON key 总量大,但查询主要集中在少数高频字段(热点字段)。
+- key 分布高度不均匀(少数 key 高频出现,大量 key 偶发出现):希望兼顾存储效率与热点查询性能。
+- 更关注“常用路径”的查询速度,能接受低频路径通过稀疏列读取(仍可正常查询,但通常更慢)。
+
+**限制与配置**
+- JSON key 总量大,但各个 JSON key 
的“非空比例/稀疏度”都比较接近、缺乏区分度:这种情况下很难区分哪些列是真正稀疏的,稀疏列机制的效果会被降低。
+- 物化为独立子列的数目(`variant_max_subcolumns_count`)建议最大上限为 **10000**。
+- 如果对“指定了预定义 Schema”的子列有较高查询要求,建议 `variant_enable_typed_paths_to_sparse` 设为 
**false**。
+- `variant_sparse_hash_shard_count` 的设置可按“进入稀疏列的总列数 / 128”粗略估算。例如:VARIANT 中所有 
JSON key 为 1 万,设置 `variant_max_subcolumns_count = 2000`,进入稀疏列的总列数约为 8000,则 
`variant_sparse_hash_shard_count` 可参考 `8000/128`。
+
+### DOC 编码机制
+
+**机制说明**
+
+- 子路径仍可物化为独立子列用于按路径查询,同时会额外保存一份“原始 JSON”作为存储字段,以便更快返回整条 JSON 文档。
+- DOC 编码支持 sharding:原始 JSON 会被拆分到多个列中存储,读取整条 JSON 时再组装这些分片;可通过 
`variant_doc_hash_shard_count` 指定 DOC 编码的实际分片数。
+- 小批量写入时可以不物化子列,后续合并时再触发子列物化:该行为由 `variant_doc_materialization_min_rows` 决定。例如 
`variant_doc_materialization_min_rows = 10000`,当写入行数低于 1 万时,该批次只写入原始 
JSON,不会物化子列。
+
+**适用场景**
+
+- JSON 的“总体 key 种类很多”,但“单行 key 数量相对少”,单行 key 数量占总体的 5% 以下(典型稀疏宽列场景)。
+- 业务经常需要返回整条 JSON(例如 `SELECT *`/整行回显/原始文档取回),不希望从大量子列重组 JSON。
+- 希望降低小批量写入时的子列物化开销,并在后续合并时再触发子列物化。
+- 可以接受额外存储开销(因为会保存原始 JSON 存储字段)。
+
+**限制与配置**
+- 需开启 `variant_enable_doc_mode`。
+- 使用 DOC 编码机制时,指定了预定义 Schema 的子列类型只能是数值类型、字符串类型、array 类型。
+- `variant_doc_hash_shard_count` 的设置可按 “JSON key 的总个数 / 128” 粗略估算。
+
+两种机制的详细使用见下方“配置”章节。
+
 ## 限制
 
-- `variant_max_subcolumns_count`:默认 0(不限制 Path 物化列数)。建议在生产设置为 2048(Tablet 
级别)以控制列数。超过阈值后,低频/稀疏路径会被收敛到共享数据结构,从该结构查询可能带来性能下降(详见“配置”)。
-- 若 Schema Template 指定了 Path 类型,则该 Path 会被强制提取;当 
`variant_enable_typed_paths_to_sparse = true` 时,它也会计入阈值,可能被收敛到共享结构。
 - JSON key 长度 ≤ 255。
 - 不支持作为主键或排序键。
 - 不支持与其他类型嵌套(如 `Array<Variant>`、`Struct<Variant>`)。
@@ -428,16 +473,40 @@ CREATE TABLE example_table (
       'path_2' : STRING,
       properties(
           'variant_max_subcolumns_count' = '2048',
-          'variant_enable_typed_paths_to_sparse' = 'true'
+          'variant_enable_typed_paths_to_sparse' = 'true',
+          'variant_sparse_hash_shard_count' = '64'
       )
   >
 );
 ```
 
 <table>
-<tr><td>属性<br/></td><td>描述<br/></td></tr>
+<tr><td>稀疏列属性<br/></td><td>描述<br/></td></tr>
 <tr><td>`variant_max_subcolumns_count`<br/></td><td>控制 Path 
物化列数的上限;超过后新增路径可能存放于共享数据结构。默认 0 表示不限,建议设置为 2048;不推荐超过 10000。<br/></td></tr>
 <tr><td>`variant_enable_typed_paths_to_sparse`<br/></td><td>默认指定了 Path 类型后,该 
Path 一定会被提取(不计入 `variant_max_subcolumns_count`)。设置为 `true` 
后也会计入阈值,可能被收敛到共享结构。<br/></td></tr>
+<tr><td>`variant_sparse_hash_shard_count`<br/></td><td>控制稀疏列的分片数量。将稀疏子列分散存储到多个稀疏列中,以提升查询性能。默认值为
 1,建议根据稀疏子列数量适当调整。<br/></td></tr>
+</table>
+
+```sql
+CREATE TABLE example_table (
+  id INT,
+  data_variant VARIANT<
+      'path_1' : INT,
+      'path_2' : STRING,
+      properties(
+          'variant_enable_doc_mode' = 'true',
+          'variant_doc_materialization_min_rows' = '10000',
+          'variant_doc_hash_shard_count' = '64'
+      )
+  >
+);
+```
+
+<table>
+<tr><td>DOC 编码属性<br/></td><td>描述<br/></td></tr>
+<tr><td>`variant_enable_doc_mode`<br/></td><td>是否启用 DOC 编码模式。设置为 `true` 时,原始 
JSON 会作为存储字段保存,用于快速返回整个 JSON 文档。启用后将无法使用稀疏列机制。默认值为 `false`。<br/></td></tr>
+<tr><td>`variant_doc_materialization_min_rows`<br/></td><td>DOC 
编码模式下触发子列物化的最小行数阈值。当写入行数低于该值时,仅存储原始 
JSON;当文件合并后行数达到该阈值时,才将子列物化为独立子列。用于减少小批量写入时的开销。<br/></td></tr>
+<tr><td>`variant_doc_hash_shard_count`<br/></td><td>控制 DOC 编码的分片数量。原始 JSON 
会被拆散存储到指定数量的列中,查询整个 JSON 时再组装这些分片。默认值为 64,可根据 JSON 大小和并发需求调整。<br/></td></tr>
 </table>
 
 达到上限后的行为与调优建议:


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to