Re: [PR] refactor variant doc and add workload guide [doris-website]

via GitHub Mon, 30 Mar 2026 03:27:57 -0700


Copilot commented on code in PR #3503:
URL: https://github.com/apache/doris-website/pull/3503#discussion_r3008909204



##########
i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md:
##########
@@ -0,0 +1,294 @@
+---
+{
+    "title": "VARIANT 使用与配置指南",
+    "language": "zh-CN",
+    "description": "帮助用户判断何时使用 VARIANT、宽列场景如何在默认模式、Sparse 和 DOC mode 
之间做选择，以及如何确定起步配置。"
+}
+---
+
+## 概览
+
+`VARIANT` 用来存储半结构化 JSON，并对常用路径执行子列列式提取（Subcolumnization）。
+
+这篇文档用来帮你给新的 `VARIANT` 场景选方案。适合在设计阶段先回答下面这些问题：
+
+- 这个场景应该用 `VARIANT`，还是直接建静态列？
+- JSON 很宽时，应该先用默认模式、Sparse，还是 DOC mode？
+- 哪些配置先保持默认，哪些配置才值得优先调整？
+
+如果你已经确定要使用 `VARIANT`，只是想查语法或类型规则，请直接看 
[VARIANT](./VARIANT)。如果你只需要一个最小可运行的导入示例，请看 [导入 Variant 
数据](../../../../data-operate/import/complex-types/variant)。
+
+:::tip 为什么使用 VARIANT
+`VARIANT` 保留了 JSON 
的灵活性，同时又能让常用路径通过子列列式提取（Subcolumnization）获得类似普通列的裁剪、聚合和索引能力，不必先把整份文档固化成静态 
schema。对超宽 JSON，存储层优化也让更大规模的 Subcolumnization 保持可控。
+:::
+
+## 什么时候适合用 VARIANT
+
+当以下条件大多成立时，`VARIANT` 往往是合适的选择：
+
+- 输入是 JSON 或其他半结构化载荷，而且字段会持续演进。
+- 查询通常集中在一部分热点路径，而不是每一行都访问全部字段。
+- 你希望保留 schema 灵活性，同时获得列式分析性能。
+- 只有少量关键路径需要索引，其余路径可以保持动态。
+
+当以下条件更占主导时，优先考虑静态列：
+
+- schema 稳定，并且可以提前定义清楚。
+- 核心字段经常作为 Join Key、排序键，或者必须严格类型治理。
+- 主要诉求是原样存档 JSON，而不是按路径做分析。
+
+## 先回答四个问题
+
+在动任何配置之前，先把下面四个问题答清楚。
+
+### 1. 有没有明确的热点路径？
+
+如果查询总是反复落在同一批 JSON Path 上，Doris 就能持续对这些路径执行子列列式提取（Subcolumnization）。这正是 
`VARIANT` 最有价值的场景。
+
+### 2. 是否有少数路径必须固定类型或稳定索引？
+
+如果有，就只给这些路径加 Schema Template。它适合少量关键字段，不适合把整份文档都描述一遍。
+
+### 3. 这是不是已经变成宽 JSON 问题？
+
+当 Path 数量持续增长，并开始带来元数据压力、Compaction 压力或明显的查询开销时，就已经进入宽 JSON 问题域。
+
+### 4. 对宽 JSON 来说，更重要的是热点路径分析，还是整条文档返回？
+
+- 如果核心价值仍然是热点字段上的过滤、聚合和索引，优先往 Sparse 方向想。
+- 如果核心价值更偏向导入效率或整条文档返回，优先往 DOC mode 方向想。
+
+## 关键概念
+
+阅读下面的模板之前，先确认以下术语清晰。每个概念用 2-3 行讲清边界；实现细节请参考 [VARIANT](./VARIANT)。
+
+**子列列式提取（Subcolumnization）。** 写入 `VARIANT` 列时，Doris 会自动发现 JSON 
Path，并对热点路径执行子列列式提取，使其以独立子列的形式参与分析。
+
+![默认 VARIANT：自动子列提取](/images/variant/variant-default-storage.png)
+
+**Schema Template。** 一种在 `VARIANT` 
列上的声明，用来把部分路径固定为稳定类型。它适合少量关键业务字段，让这些路径的类型、索引和行为更可控；不应试图穷举所有可能路径。
+
+**宽 JSON。** 当 JSON Path 的总数持续增长，并开始影响元数据规模、写入成本、Compaction 成本或查询成本时，就进入了宽 JSON 
问题域。
+
+**Sparse columns（稀疏列）。** 当宽 JSON 有明显的冷热分布时，Sparse 
让热点路径继续保留子列列式提取（Subcolumnization）的结果，而冷门（长尾）路径进入共享的稀疏存储。稀疏存储支持分片，将对多个物理列进行分散存储以提升读并行度。
+
+![Sparse Columns：冷热路径分离](/images/variant/variant-sparse-storage.png)
+
+如上图所示，热点路径（如 `user_id`、`page`）继续以独立列式子列的形式保持高性能分析能力，而数千个长尾路径则汇入共享稀疏存储。阈值通过 
`variant_max_subcolumns_count` 控制。
+
+**Sparse sharding（稀疏分片）。** 
当长尾路径数量非常大时，单个稀疏列可能成为读取瓶颈。稀疏分片通过哈希将长尾路径分散到多个物理列（`variant_sparse_hash_shard_count`），从而可以并行扫描。
+
+![Sparse Sharding：长尾路径并行读取](/images/variant/variant-sparse-sharding.png)
+
+**DOC mode。** 写入时延迟子列列式提取（Subcolumnization），并额外存储一份 map 格式的原始 
JSON。这带来了快速导入和高效整条文档返回能力，代价是额外存储。后续 Compaction 时仍会完成 Subcolumnization。
+
+![DOC Mode：延迟提取 + 快速文档返回](/images/variant/variant-doc-mode.png)
+
+如上图所示，写入时 JSON 被原样保存到 Doc Store 以实现快速导入。子列在后续 Compaction 过程中提取。读取时，按路径查询（如 
`SELECT v['user_id']`）从物化子列中以列式速度读取；而整条文档查询（`SELECT v`）则直接从 Doc Store 
中读取，无需从大量子列重组文档。
+
+**Storage Format V3。** 把列元数据从 Segment Footer 中解耦出来。推荐在所有 `VARIANT` 表上使用，尤其是宽 
JSON 场景，因为它消除了上千子列同时存在时的元数据瓶颈。
+
+## 推荐决策路径
+
+![VARIANT 模式决策路径](/images/variant/variant-decision-flowchart.png)
+
+## 几个常见起点
+
+- 事件日志或审计日志，查询主要围绕 `event_type`、`user_id`、`app_version` 这类常见路径：先用默认 
`VARIANT`，只有在路径数持续增长时再考虑 Sparse。
+- 广告、遥测或用户画像载荷，属性很多，但稳定查询的只是少量热点字段：先从 Sparse 开始。
+- 模型输出、Trace 快照或文档归档，查询经常需要整条 JSON：先从 DOC mode 开始。
+- 订单、支付或设备载荷里，只有少数路径如 `order_id`、`status`、`device_id` 必须保持稳定类型并建立索引：只给这些路径加 
Schema Template。
+
+对大多数 workload 来说，默认配置已经是合适的起点。只有在访问模式比较特殊时，才需要按场景调优。典型例子包括 AI 
训练特征载荷、车联网遥测、用户标签系统这类需要支撑大规模子列列式提取（Subcolumnization）和大量路径级索引的场景。
+
+## 推荐起步模板
+
+先用下表选一个起点，再看对应模板。
+
+| | 典型场景 | 推荐模式 | 关键配置 |
+|---|---|---|---|
+| **A** | 事件日志、审计日志 | 默认 VARIANT + V3 | 保持默认 |
+| **B** | 广告/遥测/用户画像（宽、热点少） | Sparse + V3 | 
`variant_max_subcolumns_count`、`variant_sparse_hash_shard_count` |
+| **C** | 模型输出/Trace/归档（写入优先或整条返回） | DOC mode + V3 | 
`variant_enable_doc_mode`、`variant_doc_materialization_min_rows` |
+| **D** | 订单/支付/设备（关键路径需稳定类型） | Schema Template + A 或 B | 只定义关键路径 |
+

Review Comment:
   表格头部使用了双竖线开头（`|| ...`），会导致多出空列，Docusaurus/GFM 渲染可能异常。建议改为标准 Markdown 
表格写法（单个前导 `|`），并确保表头/分隔行/数据行的列数一致。



##########
versioned_docs/version-4.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md:
##########
@@ -0,0 +1,294 @@
+---
+{
+    "title": "VARIANT Workload Guide",
+    "language": "en",
+    "description": "Decision guide for when to use VARIANT, how to choose 
between default, sparse, DOC mode, and Schema Template, and where to start with 
configuration."
+}
+---
+
+## Overview
+
+`VARIANT` stores semi-structured JSON and uses Subcolumnization on frequently 
used paths.
+
+Use this guide when you are deciding how to model a new `VARIANT` workload. It 
helps answer questions such as:
+
+- Should this workload use `VARIANT` or static columns?
+- If the JSON is very wide, should I start with default behavior, sparse 
columns, or DOC mode?
+- Which settings should I leave at defaults, and which ones should I change 
first?
+
+If you already know you want `VARIANT` and only need syntax or type rules, go 
to [VARIANT](./VARIANT). If you want the smallest runnable import example, go 
to [Import Variant Data](../../../../data-operate/import/complex-types/variant).
+
+:::tip Why choose VARIANT
+`VARIANT` keeps JSON flexible, but Doris can still apply Subcolumnization to 
frequently used paths. That lets common filters, aggregations, and path-level 
indexes work well without freezing the whole document schema in advance. On 
very wide JSON, storage-layer optimizations keep Subcolumnization practical at 
much larger path counts.
+:::
+
+## When VARIANT Fits
+
+`VARIANT` is usually a good fit when most of the following are true:
+
+- The input is JSON or another semi-structured payload whose fields evolve 
over time.
+- Queries usually touch a subset of hot paths instead of every field in every 
row.
+- You want schema flexibility without giving up columnar analytics performance.
+- Some paths need indexing, while many other paths can remain dynamic.
+
+Prefer static columns when these conditions dominate:
+
+- The schema is stable and known in advance.
+- Core fields are regularly used as join keys, sort keys, or tightly 
controlled typed columns.
+- The main requirement is to archive raw JSON, not to analyze by path.
+
+## Four Questions First
+
+Before touching any setting, answer these four questions.
+
+### 1. Are there clear hot paths?
+
+If queries repeatedly touch the same JSON paths, Doris can keep applying 
Subcolumnization to those paths. That is where `VARIANT` helps most.
+
+### 2. Do a few paths need fixed types or stable indexes?
+
+If yes, use Schema Template for those paths only. It is meant for a small set 
of business-critical fields, not for describing the whole document.
+
+### 3. Is this really becoming wide JSON?
+
+You have a wide-JSON problem when path count keeps growing and starts to 
create metadata pressure, compaction pressure, or noticeable query overhead.
+
+### 4. For wide JSON, what matters more: hot-path analytics or whole-document 
return?
+
+- If the main value is still path-based filtering, aggregation, and indexing 
on hot fields, lean toward sparse columns.
+- If the main value is ingest efficiency or returning the whole document, lean 
toward DOC mode.
+
+## Key Concepts
+
+Before reading the templates below, make sure these terms are clear. Each is 
explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
+
+**Subcolumnization.** When data is written into a `VARIANT` column, Doris 
automatically discovers JSON paths and extracts hot paths as independent 
columnar subcolumns for efficient analytics.
+
+![Default VARIANT: Automatic Subcolumn 
Extraction](/images/variant/variant-default-storage.png)
+
+**Schema Template.** A declaration on a `VARIANT` column that pins selected 
paths to stable types. Use it for key business fields that must stay typed, 
indexable, and predictable. Do not try to enumerate every possible path.
+
+**Wide JSON.** You have a wide-JSON problem when the number of distinct paths 
keeps growing and starts to increase metadata size, write cost, compaction 
cost, or query cost.
+
+**Sparse columns.** When wide JSON has a clear hot/cold split, sparse columns 
keep hot paths in Subcolumnization while pushing cold (long-tail) paths into 
shared sparse storage. Sparse storage supports sharding across multiple 
physical columns for better read parallelism.
+
+![Sparse Columns: Hot/Cold Path 
Separation](/images/variant/variant-sparse-storage.png)
+
+As shown above, hot paths (such as `user_id`, `page`) stay as independent 
columnar subcolumns with full analytics speed, while thousands of long-tail 
paths converge into shared sparse storage. The threshold is controlled by 
`variant_max_subcolumns_count`.
+
+**Sparse sharding.** When the long-tail path count is very large, a single 
sparse column can become a read bottleneck. Sparse sharding distributes 
long-tail paths by hash across multiple physical columns 
(`variant_sparse_hash_shard_count`), so they can be scanned in parallel.
+
+![Sparse Sharding: Parallel Read for Long-Tail 
Paths](/images/variant/variant-sparse-sharding.png)
+
+**DOC mode.** Delays Subcolumnization at write time and additionally stores 
the original JSON as a map-format stored field. This gives fast ingest and 
efficient whole-document return at the cost of extra storage. Subcolumnization 
still happens later during compaction.
+
+![DOC Mode: Deferred Extraction + Fast Document 
Return](/images/variant/variant-doc-mode.png)
+
+As illustrated above, during write the JSON is preserved as-is into a Doc 
Store for fast ingest. Subcolumns are extracted later during compaction. At 
read time, path-based queries (e.g. `SELECT v['user_id']`) read from 
materialized subcolumns at full columnar speed, while whole-document queries 
(`SELECT v`) read directly from the Doc Store without reconstructing from 
subcolumns.
+
+**Storage Format V3.** Decouples column metadata from the segment footer. 
Recommended for any `VARIANT` table, especially wide JSON, because it 
eliminates the metadata bottleneck when thousands of subcolumns exist.
+
+## Recommended Decision Path
+
+![VARIANT Mode Decision Path](/images/variant/variant-decision-flowchart.png)
+
+## Typical starting points
+
+- Event or audit logs where queries mostly filter on `event_type`, `user_id`, 
or `app_version`: start with default `VARIANT`, then move to sparse columns 
only if path count keeps growing.
+- Advertising, telemetry, or profile payloads with thousands of optional 
attributes but analytics on a small hot set: start with sparse columns.
+- Model responses, trace snapshots, or archived JSON documents that are often 
returned as complete payloads: start with DOC mode.
+- Order, payment, or device payloads where a few paths such as `order_id`, 
`status`, or `device_id` must stay typed and indexed: add Schema Template only 
for those paths.
+
+For most workloads, the default configuration is already the right starting 
point. Tune only when the access pattern is clearly unusual. Typical examples 
include AI training feature payloads, connected-vehicle telemetry, and user-tag 
systems that need unusually large-scale Subcolumnization together with many 
path-level indexes.
+
+## Recommended templates
+
+Use the table below to pick a starting point, then read the matching template.
+
+| | Typical scenario | Recommended mode | Key configuration |
+|---|---|---|---|
+| **A** | Event logs, audit logs | Default VARIANT + V3 | Keep defaults |
+| **B** | Advertising / telemetry / user profiles (wide, hot paths few) | 
Sparse + V3 | `variant_max_subcolumns_count`, `variant_sparse_hash_shard_count` 
|
+| **C** | Model output / trace / archives (ingest-first or whole-doc return) | 
DOC mode + V3 | `variant_enable_doc_mode`, 
`variant_doc_materialization_min_rows` |
+| **D** | Orders / payments / devices (key paths need stable types) | Schema 
Template + A or B | Define only key paths |
+

Review Comment:
   The markdown table header uses a double leading pipe (`|| ...`), which 
introduces an extra empty column and can render incorrectly in Docusaurus/GFM. 
Please rewrite this table using standard pipe syntax (single leading `|`) with 
a consistent number of columns for header, separator, and rows.



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md:
##########
@@ -0,0 +1,294 @@
+---
+{
+    "title": "VARIANT 使用与配置指南",
+    "language": "zh-CN",
+    "description": "帮助用户判断何时使用 VARIANT、宽列场景如何在默认模式、Sparse 和 DOC mode 
之间做选择，以及如何确定起步配置。"
+}
+---
+
+## 概览
+
+`VARIANT` 用来存储半结构化 JSON，并对常用路径执行子列列式提取（Subcolumnization）。
+
+这篇文档用来帮你给新的 `VARIANT` 场景选方案。适合在设计阶段先回答下面这些问题：
+
+- 这个场景应该用 `VARIANT`，还是直接建静态列？
+- JSON 很宽时，应该先用默认模式、Sparse，还是 DOC mode？
+- 哪些配置先保持默认，哪些配置才值得优先调整？
+
+如果你已经确定要使用 `VARIANT`，只是想查语法或类型规则，请直接看 
[VARIANT](./VARIANT)。如果你只需要一个最小可运行的导入示例，请看 [导入 Variant 
数据](../../../../data-operate/import/complex-types/variant)。
+
+:::tip 为什么使用 VARIANT
+`VARIANT` 保留了 JSON 
的灵活性，同时又能让常用路径通过子列列式提取（Subcolumnization）获得类似普通列的裁剪、聚合和索引能力，不必先把整份文档固化成静态 
schema。对超宽 JSON，存储层优化也让更大规模的 Subcolumnization 保持可控。
+:::
+
+## 什么时候适合用 VARIANT
+
+当以下条件大多成立时，`VARIANT` 往往是合适的选择：
+
+- 输入是 JSON 或其他半结构化载荷，而且字段会持续演进。
+- 查询通常集中在一部分热点路径，而不是每一行都访问全部字段。
+- 你希望保留 schema 灵活性，同时获得列式分析性能。
+- 只有少量关键路径需要索引，其余路径可以保持动态。
+
+当以下条件更占主导时，优先考虑静态列：
+
+- schema 稳定，并且可以提前定义清楚。
+- 核心字段经常作为 Join Key、排序键，或者必须严格类型治理。
+- 主要诉求是原样存档 JSON，而不是按路径做分析。
+
+## 先回答四个问题
+
+在动任何配置之前，先把下面四个问题答清楚。
+
+### 1. 有没有明确的热点路径？
+
+如果查询总是反复落在同一批 JSON Path 上，Doris 就能持续对这些路径执行子列列式提取（Subcolumnization）。这正是 
`VARIANT` 最有价值的场景。
+
+### 2. 是否有少数路径必须固定类型或稳定索引？
+
+如果有，就只给这些路径加 Schema Template。它适合少量关键字段，不适合把整份文档都描述一遍。
+
+### 3. 这是不是已经变成宽 JSON 问题？
+
+当 Path 数量持续增长，并开始带来元数据压力、Compaction 压力或明显的查询开销时，就已经进入宽 JSON 问题域。
+
+### 4. 对宽 JSON 来说，更重要的是热点路径分析，还是整条文档返回？
+
+- 如果核心价值仍然是热点字段上的过滤、聚合和索引，优先往 Sparse 方向想。
+- 如果核心价值更偏向导入效率或整条文档返回，优先往 DOC mode 方向想。
+
+## 关键概念
+
+阅读下面的模板之前，先确认以下术语清晰。每个概念用 2-3 行讲清边界；实现细节请参考 [VARIANT](./VARIANT)。
+
+**子列列式提取（Subcolumnization）。** 写入 `VARIANT` 列时，Doris 会自动发现 JSON 
Path，并对热点路径执行子列列式提取，使其以独立子列的形式参与分析。
+
+![默认 VARIANT：自动子列提取](/images/variant/variant-default-storage.png)
+
+**Schema Template。** 一种在 `VARIANT` 
列上的声明，用来把部分路径固定为稳定类型。它适合少量关键业务字段，让这些路径的类型、索引和行为更可控；不应试图穷举所有可能路径。
+
+**宽 JSON。** 当 JSON Path 的总数持续增长，并开始影响元数据规模、写入成本、Compaction 成本或查询成本时，就进入了宽 JSON 
问题域。
+
+**Sparse columns（稀疏列）。** 当宽 JSON 有明显的冷热分布时，Sparse 
让热点路径继续保留子列列式提取（Subcolumnization）的结果，而冷门（长尾）路径进入共享的稀疏存储。稀疏存储支持分片，将对多个物理列进行分散存储以提升读并行度。
+
+![Sparse Columns：冷热路径分离](/images/variant/variant-sparse-storage.png)
+
+如上图所示，热点路径（如 `user_id`、`page`）继续以独立列式子列的形式保持高性能分析能力，而数千个长尾路径则汇入共享稀疏存储。阈值通过 
`variant_max_subcolumns_count` 控制。
+
+**Sparse sharding（稀疏分片）。** 
当长尾路径数量非常大时，单个稀疏列可能成为读取瓶颈。稀疏分片通过哈希将长尾路径分散到多个物理列（`variant_sparse_hash_shard_count`），从而可以并行扫描。
+
+![Sparse Sharding：长尾路径并行读取](/images/variant/variant-sparse-sharding.png)
+
+**DOC mode。** 写入时延迟子列列式提取（Subcolumnization），并额外存储一份 map 格式的原始 
JSON。这带来了快速导入和高效整条文档返回能力，代价是额外存储。后续 Compaction 时仍会完成 Subcolumnization。
+
+![DOC Mode：延迟提取 + 快速文档返回](/images/variant/variant-doc-mode.png)
+
+如上图所示，写入时 JSON 被原样保存到 Doc Store 以实现快速导入。子列在后续 Compaction 过程中提取。读取时，按路径查询（如 
`SELECT v['user_id']`）从物化子列中以列式速度读取；而整条文档查询（`SELECT v`）则直接从 Doc Store 
中读取，无需从大量子列重组文档。
+
+**Storage Format V3。** 把列元数据从 Segment Footer 中解耦出来。推荐在所有 `VARIANT` 表上使用，尤其是宽 
JSON 场景，因为它消除了上千子列同时存在时的元数据瓶颈。
+
+## 推荐决策路径
+
+![VARIANT 模式决策路径](/images/variant/variant-decision-flowchart.png)
+
+## 几个常见起点
+
+- 事件日志或审计日志，查询主要围绕 `event_type`、`user_id`、`app_version` 这类常见路径：先用默认 
`VARIANT`，只有在路径数持续增长时再考虑 Sparse。
+- 广告、遥测或用户画像载荷，属性很多，但稳定查询的只是少量热点字段：先从 Sparse 开始。
+- 模型输出、Trace 快照或文档归档，查询经常需要整条 JSON：先从 DOC mode 开始。
+- 订单、支付或设备载荷里，只有少数路径如 `order_id`、`status`、`device_id` 必须保持稳定类型并建立索引：只给这些路径加 
Schema Template。
+
+对大多数 workload 来说，默认配置已经是合适的起点。只有在访问模式比较特殊时，才需要按场景调优。典型例子包括 AI 
训练特征载荷、车联网遥测、用户标签系统这类需要支撑大规模子列列式提取（Subcolumnization）和大量路径级索引的场景。
+
+## 推荐起步模板
+
+先用下表选一个起点，再看对应模板。
+
+| | 典型场景 | 推荐模式 | 关键配置 |
+|---|---|---|---|
+| **A** | 事件日志、审计日志 | 默认 VARIANT + V3 | 保持默认 |
+| **B** | 广告/遥测/用户画像（宽、热点少） | Sparse + V3 | 
`variant_max_subcolumns_count`、`variant_sparse_hash_shard_count` |
+| **C** | 模型输出/Trace/归档（写入优先或整条返回） | DOC mode + V3 | 
`variant_enable_doc_mode`、`variant_doc_materialization_min_rows` |
+| **D** | 订单/支付/设备（关键路径需稳定类型） | Schema Template + A 或 B | 只定义关键路径 |
+

Review Comment:
   表格头部使用了双竖线开头（`|| ...`），会导致多出空列，Docusaurus/GFM 渲染可能异常。建议改为标准 Markdown 
表格写法（单个前导 `|`），并确保表头/分隔行/数据行的列数一致。



##########
versioned_docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md:
##########
@@ -0,0 +1,217 @@
+---
+{
+    "title": "VARIANT Workload Guide",
+    "language": "en",
+    "description": "Decision guide for when to use VARIANT in Doris 3.x, when 
to enable sparse columns, and when to add Schema Template or path-specific 
indexes."
+}
+---
+
+## Overview
+
+`VARIANT` stores semi-structured JSON and uses Subcolumnization on frequently 
used paths.
+
+Use this guide when you are planning a new `VARIANT` workload in Doris 3.x. It 
helps answer questions such as:
+
+- Should this workload use `VARIANT` or static columns?
+- If the JSON is getting wide, should I stay with default behavior or enable 
sparse columns?
+- When should I add a Schema Template or path-specific indexes?
+
+If you already know you want `VARIANT` and only need syntax or type rules, go 
to [VARIANT](./VARIANT). If you want the smallest runnable import example, go 
to [Import Variant Data](../../../../data-operate/import/complex-types/variant).
+
+:::tip Why choose VARIANT
+`VARIANT` keeps JSON flexible while letting Doris apply Subcolumnization to 
frequently used paths. In Doris 3.1 and later, wide JSON can keep hot paths in 
Subcolumnization while long-tail paths move into sparse storage, so you do not 
need to freeze every field in advance.
+:::
+
+:::note 3.x capability boundary
+This guide only covers capabilities available in Doris 3.x. Sparse columns, 
`variant_max_subcolumns_count`, `variant_enable_typed_paths_to_sparse`, and 
path-specific indexes require Doris 3.1.0 or later. DOC mode guidance from 
newer versions does not apply to Doris 3.x.
+:::
+
+## When VARIANT Fits
+
+Use `VARIANT` when all or most of the following are true:
+
+- The input is JSON or another semi-structured payload whose fields evolve 
over time.
+- Queries usually touch a subset of hot paths instead of every field in every 
row.
+- You want schema flexibility without giving up columnar analytics performance.
+- Some paths need indexing, while many other paths can remain dynamic.
+
+Prefer static columns when these conditions dominate:
+
+- The schema is stable and known in advance.
+- Core fields are regularly used as join keys, sort keys, or tightly 
controlled typed columns.
+- The main requirement is to archive raw JSON or return the whole document 
frequently, not to analyze by path.
+
+## Key Concepts
+
+Before reading the templates below, make sure these terms are clear. Each is 
explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
+
+**Subcolumnization.** When data is written into a `VARIANT` column, Doris 
automatically discovers JSON paths and extracts hot paths as independent 
columnar subcolumns for efficient analytics.
+
+![Default VARIANT: Automatic Subcolumn 
Extraction](/images/variant/variant-default-storage.png)
+
+**Schema Template (3.1+).** A declaration on a `VARIANT` column that pins 
selected paths to stable types. Use it for key business fields that must stay 
typed, indexable, and predictable. Do not try to enumerate every possible path.
+
+**Wide JSON.** You have a wide-JSON problem when the number of distinct paths 
keeps growing and starts to increase metadata size, write cost, compaction 
cost, or query cost.
+
+**Sparse columns (3.1+).** When wide JSON has a clear hot/cold split, sparse 
columns keep hot paths in Subcolumnization while pushing cold (long-tail) paths 
into shared sparse storage. Use `variant_max_subcolumns_count` to control the 
boundary.
+
+![Sparse Columns: Hot/Cold Path 
Separation](/images/variant/variant-sparse-storage.png)
+
+As shown above, hot paths (such as `user_id`, `page`) stay as independent 
columnar subcolumns with full analytics speed, while thousands of long-tail 
paths converge into shared sparse storage. The threshold is controlled by 
`variant_max_subcolumns_count`.
+
+## Recommended Decision Path
+
+![VARIANT Mode Decision Path (Doris 
3.x)](/images/variant/variant-decision-flowchart-3x.png)
+
+For wide JSON where most queries return the whole document, Doris 3.x 
`VARIANT` is usually not the best fit because there is no DOC mode. Avoid 
making `SELECT variant_col` the main query pattern on very wide columns.
+
+For most workloads, the default configuration is already the right starting 
point. Tune only when the access pattern is clearly unusual. Typical examples 
include AI training feature payloads, connected-vehicle telemetry, and user-tag 
systems that, in Doris 3.1 and later, need unusually large-scale 
Subcolumnization together with many path-level indexes.
+
+## Recommended templates
+
+Use the table below to pick a starting point, then read the matching template.
+
+| | Typical scenario | Recommended mode | Key configuration |
+|---|---|---|---|
+| **A** | Event logs, audit logs | Default VARIANT | Keep defaults |
+| **B** | Advertising / telemetry / user profiles (wide, hot paths few) | 
Sparse (3.1+) | `variant_max_subcolumns_count` |
+| **C** | Orders / payments / devices (key paths need stable types) | Schema 
Template (3.1+) + A or B | Define only key paths |
+

Review Comment:
   The markdown table header uses a double leading pipe (`|| ...`), which 
introduces an extra empty column and can render incorrectly in Docusaurus/GFM. 
Please rewrite this table using standard pipe syntax (single leading `|`) with 
a consistent number of columns for header, separator, and rows.



##########
i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md:
##########
@@ -0,0 +1,218 @@
+---
+{
+    "title": "VARIANT 使用与配置指南",
+    "language": "zh-CN",
+    "description": "帮助用户判断在 Doris 3.x 中何时使用 VARIANT、何时开启 Sparse，以及何时增加 Schema 
Template 或路径级索引。"
+}
+---
+
+## 概览
+
+`VARIANT` 用来存储半结构化 JSON，并对常用路径执行子列列式提取（Subcolumnization）。
+
+这篇文档用来帮你给 Doris 3.x 的 `VARIANT` 场景选方案。适合在落地前先回答下面这些问题：
+
+- 这个场景应该用 `VARIANT`，还是直接建静态列？
+- JSON 开始变宽时，应该继续使用默认模式，还是开启 Sparse？
+- 什么时候应该增加 Schema Template 或路径级索引？
+
+如果你已经确定要使用 `VARIANT`，只是想查语法或类型规则，请直接看 
[VARIANT](./VARIANT)。如果你只需要一个最小可运行的导入示例，请看 [导入 Variant 
数据](../../../../data-operate/import/complex-types/variant)。
+
+:::tip 为什么使用 VARIANT
+`VARIANT` 保留了 JSON 的灵活性，同时又能让 Doris 对常用路径执行子列列式提取（Subcolumnization）。在 Doris 
3.1 及以上版本里，宽 JSON 还可以让热点路径继续保留 Subcolumnization 的结果，长尾路径进入共享稀疏存储，不必先把所有字段固定下来。
+:::
+
+:::note 3.x 版本边界
+这篇指南只覆盖 Doris 3.x 
已支持的能力。Sparse、`variant_max_subcolumns_count`、`variant_enable_typed_paths_to_sparse`
 以及路径级索引都要求 Doris 3.1.0 及以上版本；更新版本中的 DOC mode 指南不适用于 Doris 3.x。
+:::
+
+## 什么时候适合用 VARIANT
+
+当以下条件大多成立时，优先考虑 `VARIANT`：
+
+- 输入是 JSON 或其他半结构化载荷，而且字段会持续演进。
+- 查询通常集中在一部分热点路径，而不是每一行都访问全部字段。
+- 你希望保留 schema 灵活性，同时获得列式分析性能。
+- 只有少量关键路径需要索引，其余路径可以保持动态。
+
+当以下条件更占主导时，优先考虑静态列：
+
+- schema 稳定，并且可以提前定义清楚。
+- 核心字段经常作为 Join Key、排序键，或者必须严格类型治理。
+- 主要诉求是原样存档 JSON，或者频繁整条返回完整文档，而不是按路径做分析。
+
+## 关键概念
+
+阅读下面的模板之前，先确认以下术语清晰。每个概念用 2-3 行讲清边界；实现细节请参考 [VARIANT](./VARIANT)。
+
+**子列列式提取（Subcolumnization）。** 写入 `VARIANT` 列时，Doris 会自动发现 JSON 
Path，并对热点路径执行子列列式提取，使其以独立子列的形式参与分析。
+
+![默认 VARIANT：自动子列提取](/images/variant/variant-default-storage.png)
+
+**Schema Template（3.1+）。** 一种在 `VARIANT` 
列上的声明，用来把部分路径固定为稳定类型。它适合少量关键业务字段，让这些路径的类型、索引和行为更可控；不应试图穷举所有可能路径。
+
+**宽 JSON。** 当 JSON Path 的总数持续增长，并开始影响元数据规模、写入成本、Compaction 成本或查询成本时，就进入了宽 JSON 
问题域。
+
+**Sparse columns（稀疏列，3.1+）。** 当宽 JSON 有明显的冷热分布时，Sparse 
让热点路径继续保留子列列式提取（Subcolumnization）的结果，而冷门（长尾）路径进入共享的稀疏存储。使用 
`variant_max_subcolumns_count` 控制边界。
+
+![Sparse Columns：冷热路径分离](/images/variant/variant-sparse-storage.png)
+
+如上图所示，热点路径（如 `user_id`、`page`）继续以独立列式子列的形式保持高性能分析能力，而数千个长尾路径则汇入共享稀疏存储。阈值通过 
`variant_max_subcolumns_count` 控制。
+
+## 推荐决策路径
+
+![VARIANT 模式决策路径 (Doris 
3.x)](/images/variant/variant-decision-flowchart-3x.png)
+
+如果宽 JSON 的主访问模式是整条文档返回，Doris 3.x 的 `VARIANT` 往往不是最佳匹配，因为没有 DOC mode。不建议在超宽列上把 
`SELECT variant_col` 作为主查询模式。
+
+对大多数 workload 来说，默认配置已经是合适的起点。只有在访问模式比较特殊时，才需要按场景调优。典型例子包括 AI 
训练特征载荷、车联网遥测、用户标签系统这类在 Doris 3.1 
及以上版本中需要支撑大规模子列列式提取（Subcolumnization）和大量路径级索引的场景。
+
+## 推荐起步模板
+
+先用下表选一个起点，再看对应模板。
+
+| | 典型场景 | 推荐模式 | 关键配置 |
+|---|---|---|---|
+| **A** | 事件日志、审计日志 | 默认 VARIANT | 保持默认 |
+| **B** | 广告/遥测/用户画像（宽、热点少） | Sparse（3.1+） | `variant_max_subcolumns_count` |
+| **C** | 订单/支付/设备（关键路径需稳定类型） | Schema Template（3.1+） + A 或 B | 只定义关键路径 |
+

Review Comment:
   表格头部使用了双竖线开头（`|| ...`），会导致多出空列，Docusaurus/GFM 渲染可能异常。建议改为标准 Markdown 
表格写法（单个前导 `|`），并确保表头/分隔行/数据行的列数一致。



##########
docs/sql-manual/basic-element/sql-data-types/semi-structured/variant-workload-guide.md:
##########
@@ -0,0 +1,294 @@
+---
+{
+    "title": "VARIANT Workload Guide",
+    "language": "en",
+    "description": "Decision guide for when to use VARIANT, how to choose 
between default, sparse, DOC mode, and Schema Template, and where to start with 
configuration."
+}
+---
+
+## Overview
+
+`VARIANT` stores semi-structured JSON and uses Subcolumnization on frequently 
used paths.
+
+Use this guide when you are deciding how to model a new `VARIANT` workload. It 
helps answer questions such as:
+
+- Should this workload use `VARIANT` or static columns?
+- If the JSON is very wide, should I start with default behavior, sparse 
columns, or DOC mode?
+- Which settings should I leave at defaults, and which ones should I change 
first?
+
+If you already know you want `VARIANT` and only need syntax or type rules, go 
to [VARIANT](./VARIANT). If you want the smallest runnable import example, go 
to [Import Variant Data](../../../../data-operate/import/complex-types/variant).
+
+:::tip Why choose VARIANT
+`VARIANT` keeps JSON flexible, but Doris can still apply Subcolumnization to 
frequently used paths. That lets common filters, aggregations, and path-level 
indexes work well without freezing the whole document schema in advance. On 
very wide JSON, storage-layer optimizations keep Subcolumnization practical at 
much larger path counts.
+:::
+
+## When VARIANT Fits
+
+`VARIANT` is usually a good fit when most of the following are true:
+
+- The input is JSON or another semi-structured payload whose fields evolve 
over time.
+- Queries usually touch a subset of hot paths instead of every field in every 
row.
+- You want schema flexibility without giving up columnar analytics performance.
+- Some paths need indexing, while many other paths can remain dynamic.
+
+Prefer static columns when these conditions dominate:
+
+- The schema is stable and known in advance.
+- Core fields are regularly used as join keys, sort keys, or tightly 
controlled typed columns.
+- The main requirement is to archive raw JSON, not to analyze by path.
+
+## Four Questions First
+
+Before touching any setting, answer these four questions.
+
+### 1. Are there clear hot paths?
+
+If queries repeatedly touch the same JSON paths, Doris can keep applying 
Subcolumnization to those paths. That is where `VARIANT` helps most.
+
+### 2. Do a few paths need fixed types or stable indexes?
+
+If yes, use Schema Template for those paths only. It is meant for a small set 
of business-critical fields, not for describing the whole document.
+
+### 3. Is this really becoming wide JSON?
+
+You have a wide-JSON problem when path count keeps growing and starts to 
create metadata pressure, compaction pressure, or noticeable query overhead.
+
+### 4. For wide JSON, what matters more: hot-path analytics or whole-document 
return?
+
+- If the main value is still path-based filtering, aggregation, and indexing 
on hot fields, lean toward sparse columns.
+- If the main value is ingest efficiency or returning the whole document, lean 
toward DOC mode.
+
+## Key Concepts
+
+Before reading the templates below, make sure these terms are clear. Each is 
explained in 2-3 lines; for implementation details, see [VARIANT](./VARIANT).
+
+**Subcolumnization.** When data is written into a `VARIANT` column, Doris 
automatically discovers JSON paths and extracts hot paths as independent 
columnar subcolumns for efficient analytics.
+
+![Default VARIANT: Automatic Subcolumn 
Extraction](/images/variant/variant-default-storage.png)
+
+**Schema Template.** A declaration on a `VARIANT` column that pins selected 
paths to stable types. Use it for key business fields that must stay typed, 
indexable, and predictable. Do not try to enumerate every possible path.
+
+**Wide JSON.** You have a wide-JSON problem when the number of distinct paths 
keeps growing and starts to increase metadata size, write cost, compaction 
cost, or query cost.
+
+**Sparse columns.** When wide JSON has a clear hot/cold split, sparse columns 
keep hot paths in Subcolumnization while pushing cold (long-tail) paths into 
shared sparse storage. Sparse storage supports sharding across multiple 
physical columns for better read parallelism.
+
+![Sparse Columns: Hot/Cold Path 
Separation](/images/variant/variant-sparse-storage.png)
+
+As shown above, hot paths (such as `user_id`, `page`) stay as independent 
columnar subcolumns with full analytics speed, while thousands of long-tail 
paths converge into shared sparse storage. The threshold is controlled by 
`variant_max_subcolumns_count`.
+
+**Sparse sharding.** When the long-tail path count is very large, a single 
sparse column can become a read bottleneck. Sparse sharding distributes 
long-tail paths by hash across multiple physical columns 
(`variant_sparse_hash_shard_count`), so they can be scanned in parallel.
+
+![Sparse Sharding: Parallel Read for Long-Tail 
Paths](/images/variant/variant-sparse-sharding.png)
+
+**DOC mode.** Delays Subcolumnization at write time and additionally stores 
the original JSON as a map-format stored field. This gives fast ingest and 
efficient whole-document return at the cost of extra storage. Subcolumnization 
still happens later during compaction.
+
+![DOC Mode: Deferred Extraction + Fast Document 
Return](/images/variant/variant-doc-mode.png)
+
+As illustrated above, during write the JSON is preserved as-is into a Doc 
Store for fast ingest. Subcolumns are extracted later during compaction. At 
read time, path-based queries (e.g. `SELECT v['user_id']`) read from 
materialized subcolumns at full columnar speed, while whole-document queries 
(`SELECT v`) read directly from the Doc Store without reconstructing from 
subcolumns.
+
+**Storage Format V3.** Decouples column metadata from the segment footer. 
Recommended for any `VARIANT` table, especially wide JSON, because it 
eliminates the metadata bottleneck when thousands of subcolumns exist.
+
+## Recommended Decision Path
+
+![VARIANT Mode Decision Path](/images/variant/variant-decision-flowchart.png)
+
+## Typical starting points
+
+- Event or audit logs where queries mostly filter on `event_type`, `user_id`, 
or `app_version`: start with default `VARIANT`, then move to sparse columns 
only if path count keeps growing.
+- Advertising, telemetry, or profile payloads with thousands of optional 
attributes but analytics on a small hot set: start with sparse columns.
+- Model responses, trace snapshots, or archived JSON documents that are often 
returned as complete payloads: start with DOC mode.
+- Order, payment, or device payloads where a few paths such as `order_id`, 
`status`, or `device_id` must stay typed and indexed: add Schema Template only 
for those paths.
+
+For most workloads, the default configuration is already the right starting 
point. Tune only when the access pattern is clearly unusual. Typical examples 
include AI training feature payloads, connected-vehicle telemetry, and user-tag 
systems that need unusually large-scale Subcolumnization together with many 
path-level indexes.
+
+## Recommended templates
+
+Use the table below to pick a starting point, then read the matching template.
+
+| | Typical scenario | Recommended mode | Key configuration |
+|---|---|---|---|
+| **A** | Event logs, audit logs | Default VARIANT + V3 | Keep defaults |
+| **B** | Advertising / telemetry / user profiles (wide, hot paths few) | 
Sparse + V3 | `variant_max_subcolumns_count`, `variant_sparse_hash_shard_count` 
|
+| **C** | Model output / trace / archives (ingest-first or whole-doc return) | 
DOC mode + V3 | `variant_enable_doc_mode`, 
`variant_doc_materialization_min_rows` |
+| **D** | Orders / payments / devices (key paths need stable types) | Schema 
Template + A or B | Define only key paths |
+

Review Comment:
   The markdown table header uses a double leading pipe (`|| ...`), which 
introduces an extra empty column and can render incorrectly in Docusaurus/GFM. 
Please rewrite this table using standard pipe syntax (single leading `|`) with 
a consistent number of columns for header, separator, and rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] refactor variant doc and add workload guide [doris-website]

Reply via email to