This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 3a6fb4305c6 update VARIANT.md for the new features (#2753)
3a6fb4305c6 is described below
commit 3a6fb4305c6120f93e86a33c718c7dc569861b9a
Author: lihangyu <[email protected]>
AuthorDate: Fri Aug 15 16:28:57 2025 +0800
update VARIANT.md for the new features (#2753)
## Versions
- [x] dev
- [ ] 3.0
- [ ] 2.1
- [ ] 2.0
## Languages
- [x] Chinese
- [x] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
.../sql-data-types/semi-structured/VARIANT.md | 794 +++++++++++----------
.../sql-data-types/semi-structured/VARIANT.md | 708 ++++++++++--------
2 files changed, 815 insertions(+), 687 deletions(-)
diff --git
a/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
b/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
index 3c55900fc00..f9b46d3c7f7 100644
--- a/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
+++ b/docs/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
@@ -1,483 +1,501 @@
---
{
"title": "VARIANT",
- "language": "en"
+ "language": "en-US"
}
---
## VARIANT
-### Description
+## Overview
-Introduced a new data type VARIANT in Doris 2.1, which can store
semi-structured JSON data. It allows storing complex data structures containing
different data types (such as integers, strings, boolean values, etc.) without
the need to define specific columns in the table structure beforehand. The
VARIANT type is particularly useful for handling complex nested structures that
may change at any time. During the writing process, this type can automatically
infer column information based on [...]
+The VARIANT type stores semi-structured JSON data. It can contain different
primitive types (integers, strings, booleans, etc.), one-dimensional arrays,
and nested objects. On write, Doris infers the structure and type of sub-paths
based on JSON paths and materializes frequent paths as independent subcolumns,
leveraging columnar storage and vectorized execution for both flexibility and
performance.
-### Note
+## Using VARIANT
-Advantages over JSON Type:
+### Create table syntax
-1. Different storage methods: The JSON type is stored in binary JSONB format,
and the entire JSON is stored row by row in segment files. In contrast, the
VARIANT type infers types during writing and stores the written JSON columns.
It has a higher compression ratio compared to the JSON type, providing better
storage efficiency.
-2. Query: Querying does not require parsing. VARIANT fully utilizes columnar
storage, vectorized engines, optimizers, and other components in Doris,
providing users with extremely high query performance.
-Below are test results based on clickbench data:
+Declare a VARIANT column when creating a table:
-| | Storage Space |
-|--------------|------------|
-| Predefined Static Columns | 12.618 GB |
-| VARIANT Type | 12.718 GB |
-| JSON Type | 35.711 GB |
-
-**Saves approximately 65% storage capacity**
-
-| Query Counts | Predefined Static Columns | VARIANT Type | JSON Type
|
-|---------------------|---------------------------|--------------|-----------------|
-| First Query (cold) | 233.79s | 248.66s | **Most
queries timeout** |
-| Second Query (hot) | 86.02s | 94.82s | 789.24s
|
-| Third Query (hot) | 83.03s | 92.29s | 743.69s
|
-
-[test
case](https://github.com/ClickHouse/ClickBench/blob/main/doris/queries.sql)
contains 43 queries
-
-**8x faster query, query performance comparable to static columns**
-
-### Example
-
-Demonstrate the functionality and usage of VARIANT with an example covering
table creation, data import, and query cycle.
-
-**Table Creation Syntax**
-Create a table, using the `VARIANT` keyword in the syntax.
-
-``` sql
--- Without index
+```sql
CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT
)
-table_properties;
+PROPERTIES("replication_num" = "1");
+```
--- Create an index on the v column, optionally specify the tokenize method,
default is untokenized
-CREATE TABLE IF NOT EXISTS ${table_name} (
- k BIGINT,
- v VARIANT,
- INDEX idx_var(v) USING INVERTED [PROPERTIES("parser" =
"english|unicode|chinese")] [COMMENT 'your comment']
-)
-table_properties;
+Constrain certain paths with a Schema Template (see “Extended types”):
--- Create an bloom filter on v column, to enhance query seed on sub columns
+```sql
CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
- v VARIANT
+ v VARIANT <
+ 'id' : INT, -- restrict path id to INT
+ 'message*' : STRING, -- restrict message* prefix to STRING
+ 'tags*' : ARRAY<TEXT> -- restrict tags* prefix to ARRAY<TEXT>
+ >
)
-...
-properties("replication_num" = "1", "bloom_filter_columns" = "v");
-
+PROPERTIES("replication_num" = "1");
```
-**Query Syntax**
+### Query syntax
+
+```sql
+-- Access nested fields (returns VARIANT; explicit or implicit CAST is
required for aggregation/comparison)
+SELECT v['properties']['title'] FROM ${table_name};
-``` sql
--- use v['a']['b'] format for example, v['properties']['title'] type is VARIANT
-SELECT v['properties']['title'] from ${table_name}
+-- CAST to a concrete type before aggregation
+SELECT CAST(v['properties']['title'] AS STRING) AS title
+FROM ${table_name}
+GROUP BY title;
+-- Query arrays
+SELECT *
+FROM ${table_name}
+WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');
```
-**Example based on the GitHub events dataset**
+## Primitive types
+
+VARIANT infers subcolumn types automatically. Supported types include:
+
+<table>
+<tr><td>Supported types<br/></td></tr>
+<tr><td>TinyInt<br/></td></tr>
+<tr><td>NULL (equivalent to JSON null)<br/></td></tr>
+<tr><td>BigInt (64 bit)<br/>Double<br/></td></tr>
+<tr><td>String (Text)<br/></td></tr>
+<tr><td>Jsonb<br/></td></tr>
+<tr><td>Variant (nested object)<br/></td></tr>
+<tr><td>Array<T> (one-dimensional only)<br/></td></tr>
+</table>
+
+Simple INSERT example:
+
+```sql
+INSERT INTO vartab VALUES
+ (1, 'null'),
+ (2, NULL),
+ (3, 'true'),
+ (4, '-17'),
+ (5, '123.12'),
+ (6, '1.912'),
+ (7, '"A quote"'),
+ (8, '[-1, 12, false]'),
+ (9, '{ "x": "abc", "y": false, "z": 10 }'),
+ (10, '"2021-01-01"');
+```
-Here, github events data is used to demonstrate the table creation, data
import, and query using VARIANT.
-The below is a formatted line of data:
+Tip: Non-standard JSON types such as date/time will be stored as strings
unless a Schema Template is provided. For better computation efficiency,
consider extracting them to static columns or declaring their types via a
Schema Template.
-``` json
-{
- "id": "14186154924",
- "type": "PushEvent",
- "actor": {
- "id": 282080,
- "login": "brianchandotcom",
- "display_login": "brianchandotcom",
- "gravatar_id": "",
- "url": "https://api.github.com/users/brianchandotcom",
- "avatar_url": "https://avatars.githubusercontent.com/u/282080?"
- },
- "repo": {
- "id": 1920851,
- "name": "brianchandotcom/liferay-portal",
- "url": "https://api.github.com/repos/brianchandotcom/liferay-portal"
- },
- "payload": {
- "push_id": 6027092734,
- "size": 4,
- "distinct_size": 4,
- "ref": "refs/heads/master",
- "head": "91edd3c8c98c214155191feb852831ec535580ba",
- "before": "abb58cc0db673a0bd5190000d2ff9c53bb51d04d",
- "commits": [""]
- },
- "public": true,
- "created_at": "2020-11-13T18:00:00Z"
-}
-```
+## Extended types (Schema Template)
+
+Besides primitive types, VARIANT supports the following extended types via
Schema Template:
-**Table Creation**
+- Number (extended)
+ - Decimal: Decimal32 / Decimal64 / Decimal128 / Decimal256
+ - LargeInt
+- Datetime
+- Date
+- IPV4 / IPV6
+- Boolean
+- ARRAY<T> (T can be any of the above, one-dimensional only)
-- Created three columns of VARIANT type: `actor`, `repo`, and `payload`.
-- Simultaneously created an inverted index, `idx_payload`, for the `payload`
column while creating the table.
-- Specified the index type as inverted using `USING INVERTED`, aimed at
accelerating conditional filtering of sub-columns.
-- `PROPERTIES("parser" = "english")` specified the adoption of English
tokenization.
+Note: Predefined Schema can only be specified at table creation. ALTER is
currently not supported (future versions may support adding new subcolumn
definitions, but changing an existing subcolumn type is not supported).
-``` sql
-CREATE DATABASE test_variant;
-USE test_variant;
-CREATE TABLE IF NOT EXISTS github_events (
+Example:
+
+```sql
+CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
- type VARCHAR(30) NULL,
- actor VARIANT NULL,
- repo VARIANT NULL,
- payload VARIANT NULL,
- public BOOLEAN NULL,
- created_at DATETIME NULL,
- INDEX idx_payload (`payload`) USING INVERTED PROPERTIES("parser" =
"english") COMMENT 'inverted index for payload'
+ v1 VARIANT<
+ 'large_int_val': LARGEINT,
+ 'string_val': STRING,
+ 'decimal_val': DECIMAL(38, 9),
+ 'datetime_val': DATETIME,
+ 'ip_val': IPV4
+ > NULL
)
-DUPLICATE KEY(`id`)
-DISTRIBUTED BY HASH(id) BUCKETS 10
-properties("replication_num" = "1");
+PROPERTIES ("replication_num" = "1");
+
+INSERT INTO test_var_schema VALUES (1, '{
+ "large_int_val" : "123222222222222222222222",
+ "string_val" : "Hello World",
+ "decimal_val" : 1.11111111,
+ "datetime_val" : "2025-05-16 11:11:11",
+ "ip_val" : "127.0.0.1"
+}');
+
+SELECT variant_type(v1) FROM test_var_schema;
+
++----------------------------------------------------------------------------------------------------------------------------+
+| variant_type(v1)
|
++----------------------------------------------------------------------------------------------------------------------------+
+|
{"datetime_val":"datetimev2","decimal_val":"decimal128i","ip_val":"ipv4","large_int_val":"largeint","string_val":"string"}
|
++----------------------------------------------------------------------------------------------------------------------------+
```
-:::tip
+`{"date": 2020-01-01}` and `{"ip": 127.0.0.1}` are invalid JSON texts; the
correct format is `{"date": "2020-01-01"}` and `{"ip": "127.0.0.1"}`.
-1. Creating an index on VARIANT columns, such as when there are numerous
sub-columns in payload, might lead to an excessive number of index columns,
impacting write performance.
-2. The tokenization properties for the same VARIANT column are uniform. If you
have varied tokenization requirements, consider creating multiple VARIANT
columns and specifying index properties separately for each.
+Once a Schema Template is specified, if a JSON value conflicts with the
declared type and cannot be converted, it will be stored as NULL. For example:
-:::
+```sql
+INSERT INTO test_var_schema VALUES (1, '{
+ "decimal_val" : "1.11111111",
+ "ip_val" : "127.xxxxxx.xxxx",
+ "large_int_val" : "aaabbccc"
+}');
+-- Only decimal_val remains
+SELECT * FROM test_var_schema;
-**Using Streamload for Import**
++------+-----------------------------+
+| id | v1 |
++------+-----------------------------+
+| 1 | {"decimal_val":1.111111110} |
++------+-----------------------------+
+```
-Importing gh_2022-11-07-3.json, which contains one hour's worth of GitHub
events data.
+Schema only guides the persisted storage type. During query execution, the
effective type depends on actual data at runtime:
-``` shell
-wget
https://qa-build.oss-cn-beijing.aliyuncs.com/regression/variant/gh_2022-11-07-3.json
+```sql
+-- At runtime v['a'] may still be STRING
+SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);
+```
-curl --location-trusted -u root: -T gh_2022-11-07-3.json -H
"read_json_by_line:true" -H "format:json"
http://127.0.0.1:18148/api/test_variant/github_events/_strea
-m_load
+Wildcard matching and order:
-{
- "TxnId": 2,
- "Label": "086fd46a-20e6-4487-becc-9b6ca80281bf",
- "Comment": "",
- "TwoPhaseCommit": "false",
- "Status": "Success",
- "Message": "OK",
- "NumberTotalRows": 139325,
- "NumberLoadedRows": 139325,
- "NumberFilteredRows": 0,
- "NumberUnselectedRows": 0,
- "LoadBytes": 633782875,
- "LoadTimeMs": 7870,
- "BeginTxnTimeMs": 19,
- "StreamLoadPutTimeMs": 162,
- "ReadDataTimeMs": 2416,
- "WriteDataTimeMs": 7634,
- "CommitAndPublishTimeMs": 55
-}
+```sql
+CREATE TABLE test_var_schema (
+ id BIGINT NOT NULL,
+ v1 VARIANT<
+ 'enumString*' : STRING,
+ 'enum*' : ARRAY<TEXT>,
+ 'ip*' : IPV6
+ > NULL
+)
+PROPERTIES ("replication_num" = "1");
+
+-- If enumString1 matches both patterns, the first matching pattern in
definition order (STRING) is used
```
-Confirm the successful import.
-
-``` sql
--- View the number of rows.
-mysql> select count() from github_events;
-+----------+
-| count(*) |
-+----------+
-| 139325 |
-+----------+
-1 row in set (0.25 sec)
-
--- Random select one row
-mysql> select * from github_events limit 1;
-+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------
[...]
-| id | type | actor
| repo
| payload
[...]
-+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------
[...]
-| 25061821748 | PushEvent |
{"gravatar_id":"","display_login":"jfrog-pipelie-intg","url":"https://api.github.com/users/jfrog-pipelie-intg","id":98024358,"login":"jfrog-pipelie-intg","avatar_url":"https://avatars.githubusercontent.com/u/98024358?"}
|
{"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16","id":562683829,"name":"jfrog-pipelie-intg/jfinte2e_1667789956723_16"}
|
{"commits":[{"sha":"334433de436baa198024ef9f55f0647721bcd750","author":{"email":"980243
[...]
-+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------
[...]
-1 row in set (0.23 sec)
+If a column name contains `*` and you want to match it by its literal name
(not as a prefix wildcard), use:
+
+```sql
+v1 VARIANT<
+ MATCH_NAME 'enumString*' : STRING
+> NULL
```
-Running desc command to view schema information, sub-columns will
automatically expand at the storage layer and undergo type inference.
-
-``` sql
-mysql> desc github_events;
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| Field | Type |
Null | Key | Default | Extra |
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| id | BIGINT | No
| true | NULL | |
-| type | VARCHAR(*) |
Yes | false | NULL | NONE |
-| actor | VARIANT |
Yes | false | NULL | NONE |
-| created_at | DATETIME |
Yes | false | NULL | NONE |
-| payload | VARIANT |
Yes | false | NULL | NONE |
-| public | BOOLEAN |
Yes | false | NULL | NONE |
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-6 rows in set (0.07 sec)
-
-mysql> set describe_extend_variant_column = true;
-Query OK, 0 rows affected (0.01 sec)
-
-mysql> desc github_events;
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| Field | Type |
Null | Key | Default | Extra |
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| id | BIGINT | No
| true | NULL | |
-| type | VARCHAR(*) |
Yes | false | NULL | NONE |
-| actor | VARIANT |
Yes | false | NULL | NONE |
-| actor.avatar_url | TEXT |
Yes | false | NULL | NONE |
-| actor.display_login | TEXT |
Yes | false | NULL | NONE |
-| actor.id | INT |
Yes | false | NULL | NONE |
-| actor.login | TEXT |
Yes | false | NULL | NONE |
-| actor.url | TEXT |
Yes | false | NULL | NONE |
-| created_at | DATETIME |
Yes | false | NULL | NONE |
-| payload | VARIANT |
Yes | false | NULL | NONE |
-| payload.action | TEXT |
Yes | false | NULL | NONE |
-| payload.before | TEXT |
Yes | false | NULL | NONE |
-| payload.comment.author_association | TEXT |
Yes | false | NULL | NONE |
-| payload.comment.body | TEXT |
Yes | false | NULL | NONE |
-....
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-406 rows in set (0.07 sec)
+
+Matched subpaths are materialized as columns by default. If too many paths
match and generate excessive columns, consider enabling
`variant_enable_typed_paths_to_sparse` (see “Configuration”).
+
+## Type conflicts and promotion rules
+
+When incompatible types appear on the same path (e.g., the same field shows up
as both integer and string), the type is promoted to JSONB to avoid information
loss:
+
+```sql
+{"a" : 12345678}
+{"a" : "HelloWorld"}
+-- a will be promoted to JSONB
```
-DESC can be used to specify partition and view the schema of a particular
partition. The syntax is as follows:
-``` sql
-DESCRIBE ${table_name} PARTITION ($partition_name);
+Promotion rules:
+
+| Source type | Current type | Final type |
+| -------------- | ------------- | ------------ |
+| `TinyInt` | `BigInt` | `BigInt` |
+| `TinyInt` | `Double` | `Double` |
+| `TinyInt` | `String` | `JSONB` |
+| `TinyInt` | `Array` | `JSONB` |
+| `BigInt` | `Double` | `JSONB` |
+| `BigInt` | `String` | `JSONB` |
+| `BigInt` | `Array` | `JSONB` |
+| `Double` | `String` | `JSONB` |
+| `Double` | `Array` | `JSONB` |
+| `Array<Double>`| `Array<String>`| `Array<Jsonb>` |
+
+If you need strict types (for stable indexing and storage), declare them via
Schema Template.
+
+## Variant indexes
+
+### Choosing indexes
+
+VARIANT supports BloomFilter and Inverted Index on subpaths.
+- High-cardinality equality/IN filters: prefer BloomFilter (sparser index,
better write performance).
+- Tokenization/phrase/range search: use Inverted Index and set proper
`parser`/`analyzer` properties.
+
+```sql
+...
+PROPERTIES("replication_num" = "1", "bloom_filter_columns" = "v");
+
+-- Use BloomFilter for equality/IN filters
+SELECT * FROM tbl WHERE v['id'] = 12345678;
+SELECT * FROM tbl WHERE v['id'] IN (1, 2, 3);
```
-**Querying**
-
-:::tip
-
-When utilizing filtering and aggregation functionalities to query sub-columns,
additional casting operations need to be performed on sub-columns (because the
storage types are not necessarily fixed and require a unified SQL type).
-For instance, `SELECT * FROM tbl where CAST(var['titile'] as text) MATCH
"hello world"`
-The simplified examples below illustrate how to use VARIANT for querying:
-The following are three typical query scenarios
-
-:::
-
-1. Retrieve the top 5 repositories based on star count from the
`github_events` table.
-
-``` sql
-mysql> SELECT
- -> cast(repo['name'] as text) as repo_name, count() AS stars
- -> FROM github_events
- -> WHERE type = 'WatchEvent'
- -> GROUP BY repo_name
- -> ORDER BY stars DESC LIMIT 5;
-+--------------------------+-------+
-| repo_name | stars |
-+--------------------------+-------+
-| aplus-framework/app | 78 |
-| lensterxyz/lenster | 77 |
-| aplus-framework/database | 46 |
-| stashapp/stash | 42 |
-| aplus-framework/image | 34 |
-+--------------------------+-------+
-5 rows in set (0.03 sec)
+Once an inverted index is created on a VARIANT column, all subpaths inherit
the same index properties (e.g., parser):
+
+```sql
+CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT,
+ INDEX idx_v(v) USING INVERTED PROPERTIES("parser" = "english")
+);
+
+-- All subpaths inherit the english parser
+SELECT * FROM tbl WHERE v['id_1'] MATCH 'Doris';
+SELECT * FROM tbl WHERE v['id_2'] MATCH 'Apache';
```
-2. Retrieve the count of comments containing "doris".
-
-``` sql
--- implicit cast `payload['comment']['body']` to string type
-mysql> SELECT
- -> count() FROM github_events
- -> WHERE payload['comment']['body'] MATCH 'doris';
-+---------+
-| count() |
-+---------+
-| 3 |
-+---------+
-1 row in set (0.04 sec)
+### Index by subpath
+
+In 3.1.x/4.0 and later, you can specify index properties for certain VARIANT
subpaths, and even configure both tokenized and non-tokenized inverted indexes
for the same path. Path-specific indexes require the path type to be declared
via Schema Template.
+
+```sql
+-- Common properties: field_pattern (target path), analyzer, parser,
support_phrase, etc.
+CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT<'content' : STRING>,
+ INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english",
"field_pattern" = "content"),
+ INDEX idx_v(v) USING INVERTED PROPERTIES("field_pattern" = "content")
+);
+
+-- v.content has both tokenized and non-tokenized inverted indexes
+SELECT * FROM tbl WHERE v['content'] MATCH 'Doris';
+SELECT * FROM tbl WHERE v['content'] = 'Doris';
```
-3. Query the issue number with the highest number of comments along with its
corresponding repository.
-
-``` sql
-mysql> SELECT
- -> cast(repo['name'] as string) as repo_name,
- -> cast(payload['issue']['number'] as int) as issue_number,
- -> count() AS comments,
- -> count(
- -> distinct cast(actor['login'] as string)
- -> ) AS authors
- -> FROM github_events
- -> WHERE type = 'IssueCommentEvent' AND (cast(payload["action"] as string)
= 'created') AND (cast(payload["issue"]["number"] as int) > 10)
- -> GROUP BY repo_name, issue_number
- -> HAVING authors >= 4
- -> ORDER BY comments DESC, repo_name
- -> LIMIT 50;
-+--------------------------------------+--------------+----------+---------+
-| repo_name | issue_number | comments | authors |
-+--------------------------------------+--------------+----------+---------+
-| facebook/react-native | 35228 | 5 | 4 |
-| swsnu/swppfall2022-team4 | 27 | 5 | 4 |
-| belgattitude/nextjs-monorepo-example | 2865 | 4 | 4 |
-+--------------------------------------+--------------+----------+---------+
-3 rows in set (0.03 sec)
+Wildcard path indexing:
+
+```sql
+CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT<'pattern_*' : STRING>,
+ INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english",
"field_pattern" = "pattern_*"),
+ INDEX idx_v(v) USING INVERTED -- global non-tokenized inverted index
+);
+
+SELECT * FROM tbl WHERE v['pattern_1'] MATCH 'Doris';
+SELECT * FROM tbl WHERE v['pattern_1'] = 'Doris';
```
-### Nested Array
-```json
-{
- "nested" : [{"field1" : 123, "field11" : "123"}, {"field2" : 456, "field22"
: "456"}]
-}
+Note: 2.1.7+ supports only InvertedIndex V2 properties (fewer files, lower
write IOPS; suitable for disaggregated storage/compute). 2.1.8+ removes offline
Build Index.
+
+### When indexes don’t work
+
+1. Type changes cause index loss: if a subpath changes to an incompatible type
(e.g., INT → JSONB), the index is lost. Fix by pinning types and indexes via
Schema Template.
+2. Query type mismatch:
+ ```sql
+ -- v['id'] is actually STRING; using INT equality causes index not to be
used
+ SELECT * FROM tbl WHERE v['id'] = 123456;
+ ```
+3. Misconfigured index: indexes apply to subpaths, not the entire VARIANT
column.
+ ```sql
+ -- VARIANT itself cannot be indexed as a whole
+ SELECT * FROM tbl WHERE v MATCH 'Doris';
+
+ -- If whole-JSON search is needed, store a duplicate STRING column and
index it
+ CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT,
+ v_str STRING,
+ INDEX idx_v_str(v_str) USING INVERTED PROPERTIES("parser" = "english")
+ );
+ SELECT * FROM tbl WHERE v_str MATCH 'Doris';
+ ```
+
+## INSERT and load
+
+### INSERT INTO VALUES
+
+```sql
+CREATE TABLE IF NOT EXISTS variant_tbl (
+ k BIGINT,
+ v VARIANT
+) PROPERTIES("replication_num" = "1");
+
+INSERT INTO variant_tbl VALUES (1, '{"a" : 123}');
+
+select * from variant_tbl;
++------+-----------+
+| k | v |
++------+-----------+
+| 1 | {"a":123} |
++------+-----------+
+
+-- v['a'] is a VARIANT
+select v['a'] from variant_tbl;
++--------+
+| v['a'] |
++--------+
+| 123 |
++--------+
+
+-- Accessing a non-existent key returns NULL
+select v['a']['no_such_key'] from variant_tbl;;
++-----------------------+
+| v['a']['no_such_key'] |
++-----------------------+
+| NULL |
++-----------------------+
+
```
-In the JSON example above, the array nested contains objects (or nested data
types). It’s important to note that **only one level of array expansion is
currently supported, and the array must be a subfield of an object**. Here is
an example:
-``` sql
--- Note: Set variant_enable_flatten_nested to true
--- This setting enables nested array expansion, allowing array<object>
elements to be stored in columnar format.
--- If set to false, nested arrays will be stored as JSON types.
-CREATE TABLE `simple_nested_test` (
- `k` bigint NULL,
- `v` variant NULL
-) ENGINE=OLAP
-DUPLICATE KEY(`k`)
-DISTRIBUTED BY HASH(`k`) BUCKETS 8
-PROPERTIES (
-"file_cache_ttl_seconds" = "0",
-"is_being_synced" = "false",
-"storage_medium" = "hdd",
-"storage_format" = "V2",
-"inverted_index_storage_format" = "V2",
-"light_schema_change" = "true",
-"disable_auto_compaction" = "false",
-"variant_enable_flatten_nested" = "true",
-"enable_single_replica_compaction" = "false",
-"group_commit_interval_ms" = "10000",
-"group_commit_data_bytes" = "134217728"
-);
-insert into simple_nested_test values(1, '{
- "eventId": 1,
- "firstName": "Name1",
- "lastName": "Eric",
- "body": {
- "phoneNumbers": [
- {
- "number": "1111111111",
- "type": "GSM",
- "callLimit": 5
- },
- {
- "number": "222222222",
- "type": "HOME",
- "callLimit": 3
- },
- {
- "number": "33333333",
- "callLimit": 2,
- "type": "WORK"
- }
- ]
- }
-}');
+### Load (Stream Load)
--- Enable extended column descriptions
-set describe_extend_variant_column = true;
-
--- The DESC command will display expanded columns such as
v.body.phoneNumbers.callLimit, v.body.phoneNumbers.number, and
v.body.phoneNumbers.type
--- These fields are expanded from v.body.phoneNumbers
-mysql> desc simple_nested_test;
-+-------------------------------+----------------+------+-------+---------+-------+
-| Field | Type | Null | Key | Default |
Extra |
-+-------------------------------+----------------+------+-------+---------+-------+
-| k | bigint | Yes | true | NULL |
|
-| v | variant | Yes | false | NULL |
NONE |
-| v.body.phoneNumbers.callLimit | array<tinyint> | Yes | false | NULL |
NONE |
-| v.body.phoneNumbers.number | array<text> | Yes | false | NULL |
NONE |
-| v.body.phoneNumbers.type | array<text> | Yes | false | NULL |
NONE |
-| v.eventId | tinyint | Yes | false | NULL |
NONE |
-| v.firstName | text | Yes | false | NULL |
NONE |
-| v.lastName | text | Yes | false | NULL |
NONE |
-+-------------------------------+----------------+------+-------+---------+-------+
-8 rows in set (0.00 sec)
-
--- Use lateral view (explode_variant_array) to expand arrays and query phone
numbers and event IDs that meet specific criteria
-mysql> select v['eventId'], phone_numbers
- from simple_nested_test lateral view
explode_variant_array(v['body']['phoneNumbers']) tmp1 as phone_numbers
- where phone_numbers['type'] = 'GSM' OR phone_numbers['type'] = 'HOME' and
phone_numbers['callLimit'] > 2;
-+--------------------------+----------------------------------------------------+
-| element_at(v, 'eventId') | phone_numbers
|
-+--------------------------+----------------------------------------------------+
-| 1 |
{"callLimit":5,"number":"1111111111","type":"GSM"} |
-| 1 |
{"callLimit":3,"number":"222222222","type":"HOME"} |
-+--------------------------+----------------------------------------------------+
-2 rows in set (0.02 sec)
+```bash
+# Line-delimited JSON (one JSON record per line)
+curl --location-trusted -u root: -T gh_2022-11-07-3.json \
+ -H "read_json_by_line:true" -H "format:json" \
+ http://127.0.0.1:8030/api/test_variant/github_events/_stream_load
```
-### Usage Restrictions and Best Practices
+See also:
`https://doris.apache.org/docs/dev/data-operate/import/complex-types/variant`
-**There are several limitations when using the VARIANT type:**
-Dynamic columns of VARIANT are nearly as efficient as predefined static
columns. When dealing with data like logs, where fields are often added
dynamically (such as container labels in Kubernetes), parsing JSON and
inferring types can generate additional costs during write operations.
Therefore, it's recommended to keep the number of columns for a single import
below 1000.
+After loading, verify with `SELECT count(*)` or sample with `SELECT * ...
LIMIT 1`. For high-throughput ingestion, prefer RANDOM bucketing and enable
Group Commit.
-Ensure consistency in types whenever possible. Doris automatically performs
compatible type conversions. When a field cannot undergo compatible type
conversion, it is uniformly converted to JSONB type. The performance of JSONB
columns may degrade compared to columns like int or text.
+## Supported operations and CAST rules
-1. tinyint -> smallint -> int -> bigint, integer types can be promoted
following the direction of the arrows.
-2. float -> double, floating-point numbers can be promoted following the
direction of the arrow.
-3. text, string type.
-4. JSON, binary JSON type.
+- VARIANT cannot be compared/operated directly with other types; comparisons
between two VARIANTs are not supported either.
+- For comparison, filtering, aggregation, and ordering, CAST subpaths to
concrete types (explicitly or implicitly).
-When the above types cannot be compatible, they will be transformed into JSON
type to prevent loss of type information. If you need to set a strict schema in
VARIANT, the VARIANT MAPPING mechanism will be introduced soon.
+```sql
+-- Explicit CAST
+SELECT CAST(v['arr'] AS ARRAY<TEXT>) FROM tbl;
+SELECT * FROM tbl WHERE CAST(v['decimal'] AS DECIMAL(27, 9)) = 1.111111111;
+SELECT * FROM tbl WHERE CAST(v['date'] AS DATE) = '2021-01-02';
-**Other limitations include:**
+-- Implicit CAST
+SELECT * FROM tbl WHERE v['bool'];
+SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
+```
+
+- VARIANT itself cannot be used directly in ORDER BY, GROUP BY, as a JOIN KEY,
or as an aggregate argument; CAST subpaths instead.
+- Strings can be implicitly converted to VARIANT.
+
+| VARIANT | Castable | Coercible | Conversion Function |
+| --------------- | -------- | --------- | ------------------- |
+| `ARRAY` | ✔ | ❌ | |
+| `BOOLEAN` | ✔ | ✔ | |
+| `DATE/DATETIME` | ✔ | ✔ | |
+| `FLOAT` | ✔ | ✔ | |
+| `IPV4/IPV6` | ✔ | ✔ | |
+| `DECIMAL` | ✔ | ✔ | |
+| `MAP` | ❌ | ❌ | |
+| `TIMESTAMP` | ✔ | ✔ | |
+| `VARCHAR` | ✔ | ✔ | `PARSE_TO_JSON` |
+| `JSON` | ✔ | ✔ | |
+
+## Limitations
+
+- `variant_max_subcolumns_count`: default 0 (no limit). In production, set to
2048 (tablet level) to control the number of materialized paths. Above the
threshold, low-frequency/sparse paths are moved to a shared data structure;
reading from it may be slower (see “Configuration”).
+- If a path type is specified via Schema Template, that path will be forced to
be materialized; when `variant_enable_typed_paths_to_sparse = true`, it also
counts toward the threshold and may be moved to the shared structure.
+- JSON key length ≤ 255.
+- Cannot be a primary key or sort key.
+- Cannot be nested within other types (e.g., `Array<Variant>`,
`Struct<Variant>`).
+- Reading the entire VARIANT column scans all subpaths. If a column has many
subpaths, consider storing the original JSON string in an extra STRING/JSONB
column for whole-object searches like `LIKE`:
+
+```sql
+CREATE TABLE example_table (
+ id INT,
+ data_variant VARIANT
+);
+SELECT * FROM example_table WHERE data_variant LIKE '%doris%';
-- VARIANT columns can only create inverted indexes or bloom filter to speed up
query.
-- Using the **RANDOM** mode or [group
commit](/docs/data-operate/import/group-commit-manual.md) mode is recommended
for higher write performance.
-- Non-standard JSON types such as date and decimal should ideally use static
types for better performance, since these types are infered to text type.
-- Arrays with dimensions of 2 or higher will be stored as JSONB encoding,
which might perform less efficiently than native arrays.
-- Not supported as primary or sort keys.
-- Queries with filters or aggregations require casting. The storage layer
eliminates cast operations based on storage type and the target type of the
cast, speeding up queries.
-- Reading a VARIANT column inherently involves scanning all its subfields. If
the column contains numerous subfields, this can lead to substantial scan
overhead and negatively impact query performance. To optimize performance when
you need to retrieve the entire column, consider adding an additional column of
type STRING or JSONB to store the raw JSON string. Example:
-``` sql
--- Lead to scan all subfields of data_variant
+-- Better: keep the original JSON string for whole-object matching
CREATE TABLE example_table (
id INT,
+ data_string STRING,
data_variant VARIANT
);
-SELECT * FROM example_table WHERE data_variant LIKE '%doris%'
+SELECT * FROM example_table WHERE data_string LIKE '%doris%';
+```
+
+## Configuration
--- Better performance for `LIKE`
+Starting from 3.1+, VARIANT supports type-level properties on columns:
+
+```sql
CREATE TABLE example_table (
id INT,
- data_variant VARIANT,
- data_string STRING
+ data_variant VARIANT<
+ 'path_1' : INT,
+ 'path_2' : STRING,
+ properties(
+ 'variant_max_subcolumns_count' = '2048',
+ 'variant_enable_typed_paths_to_sparse' = 'true'
+ )
+ >
);
-SELECT * FROM example_table WHERE data_string LIKE '%doris%'
```
-**Tuning Techniques for Column-Count Limits:**
+<table>
+<tr><td>Property<br/></td><td>Description<br/></td></tr>
+<tr><td>`variant_max_subcolumns_count`<br/></td><td>Max number of materialized
paths. Above the threshold, new paths may be stored in a shared data structure.
Default 0 (unlimited). Recommended 2048; do not exceed 10000.<br/></td></tr>
+<tr><td>`variant_enable_typed_paths_to_sparse`<br/></td><td>By default, typed
paths are always materialized (and do not count against
`variant_max_subcolumns_count`). When set to `true`, typed paths also count
toward the threshold and may be moved to the shared structure.<br/></td></tr>
+</table>
+
+Behavior at limits and tuning suggestions:
-Note: If the number of sub-columns exceeds 5,000, higher requirements for
memory and configuration apply. On a single machine, aim for at least 128 GB of
RAM and 32 CPU cores.
+1. After exceeding the threshold, new paths are written into the shared
structure; Rowset merges may also recycle some paths into the shared structure.
+2. The system prefers to keep paths with higher non-null ratios and higher
access frequencies materialized.
+3. Close to 10,000 materialized paths requires strong hardware (≥128G RAM,
≥32C per node recommended).
+4. Ingestion tuning: increase client `batch_size` appropriately, or use Group
Commit (increase `group_commit_interval_ms`/`group_commit_data_bytes` as
needed).
+5. If partition pruning is not needed, consider RANDOM bucketing and enabling
single-tablet loading to reduce compaction write amplification.
+6. BE tuning knobs: `max_cumu_compaction_threads` (≥8),
`vertical_compaction_num_columns_per_group=500` (improves vertical compaction
but increases memory), `segment_cache_memory_percentage=20` (improves metadata
cache efficiency).
+7. Watch Compaction Score; if it keeps rising, compaction is lagging—reduce
ingestion pressure.
+8. Avoid large `SELECT *` on VARIANT; prefer specific projections like `SELECT
v['path']`.
-1. In BE configuration, adjust `variant_max_merged_tablet_schema_size=n`,
where n should be greater than the actual number of columns (not recommended to
exceed 10,000).
+Note: If you see Stream Load error `[DATA_QUALITY_ERROR]Reached max column
size limit 2048` (only on 2.1.x and 3.0.x), it means the merged tablet schema
reached its column limit. You may increase
`variant_max_merged_tablet_schema_size` (not recommended beyond 4096; requires
strong hardware).
-2. Be aware that extracting too many columns will put heavy pressure on
compaction (import throughput must be throttled accordingly). Increasing the
client-side import `batch_size`—based on memory usage—can reduce write
amplification during compaction. Alternatively, enable `group_commit` (a table
property) and appropriately increase `group_commit_interval_ms` and
`group_commit_data_bytes`.
+## Inspect number of columns and types
-3. If your queries do not require bucket pruning, use random bucketing and
enable the
[load_to_single_tablet](../../../../table-design/data-partitioning/data-bucketing#bucketing)
import setting (an import configuration) to reduce compaction write
amplification.
+Approach 1: use `variant_type` to inspect per-row schema (more precise, higher
cost):
+
+```sql
+SELECT variant_type(v) FROM variant_tbl;
+```
+
+Approach 2: extended `DESC` to show materialized subpaths (only those
extracted):
+
+```sql
+SET describe_extend_variant_column = true;
+DESC variant_tbl;
+```
-4. In BE configuration, adjust `max_cumu_compaction_threads` according to
import pressure; ensure at least 8 threads.
+Use both: Approach 1 is precise; Approach 2 is efficient.
-5. In BE configuration, set `vertical_compaction_num_columns_per_group=500` to
improve grouped-compaction efficiency, although this increases memory overhead.
+## Compared with JSON type
-6. In BE configuration, set `segment_cache_memory_percentage=20` to increase
segment cache capacity and improve metadata caching efficiency.
+- Storage: JSON is stored as JSONB (row-oriented). VARIANT is inferred and
materialized into columns on write (higher compression, smaller size).
+- Query: JSON requires parsing. VARIANT scans columns directly and is usually
much faster.
-7. Monitor the Compaction Score closely. A continuously rising score indicates
that compaction cannot keep up (import pressure should be reduced accordingly).
+ClickBench (43 queries):
+- Storage: VARIANT saves ~65% vs JSON.
+- Query: VARIANT is 8x+ faster than JSON, close to predefined static columns.
-8. Using `SELECT *` or `SELECT variant` can significantly increase
cluster-wide pressure, potentially causing timeouts or out-of-memory errors. It
is recommended to include path information in queries—for example, `SELECT
variant['path_1']`.
+**Storage space**
-### FAQ
+| Type | Size |
+| ------------------- | ---------- |
+| Predefined columns | 12.618 GB |
+| VARIANT | 12.718 GB |
+| JSON | 35.711 GB |
-1. Streamload Error: [CANCELLED][INTERNAL_ERROR] tablet error:
[DATA_QUALITY_ERROR] Reached max column size limit 2048.
+**~65% space savings**
-Due to compaction and metadata storage limitations, the VARIANT type imposes a
limit on the number of columns, with the default being 2048 columns. You can
adjust the BE configuration `variant_max_merged_tablet_schema_size`
accordingly, but it is not recommended to exceed 4096 columns(Requires
higher-spec hardware).
+| Run | Predefined | VARIANT | JSON |
+| ----------------| ---------- | ------- | --------------- |
+| First (cold) | 233.79s | 248.66s | Most timed out |
+| Second (hot) | 86.02s | 94.82s | 789.24s |
+| Third (hot) | 83.03s | 92.29s | 743.69s |
-2. Is there a difference between null in the VARIANT type (e.g., `{"key":
null}`) and SQL NULL (i.e., IS NULL)?
+## FAQ
-No, there is no difference — in the VARIANT type, they are considered
equivalent.
+1. Are `null` in VARIANT and SQL `NULL` different?
+ - No. They are equivalent.
+2. Why doesn’t my query/index work?
+ - Check whether you CAST paths to the correct types; whether the type was
promoted to JSONB due to conflicts; or whether you mistakenly expect an index
on the whole VARIANT instead of on subpaths.
-### Keywords
- VARIANT
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
index e9cef16d195..cd16f7079d0 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/VARIANT.md
@@ -9,352 +9,401 @@
## 描述
-在 Doris 2.1 中引入一种新的数据类型 VARIANT,它可以存储半结构化 JSON
数据。它允许存储包含不同数据类型(如整数、字符串、布尔值等)的复杂数据结构,而无需在表结构中提前定义具体的列。VARIANT
类型特别适用于处理复杂的嵌套结构,而这些结构可能随时会发生变化。在写入过程中,该类型可以自动根据列的结构、类型推断列信息,动态合并写入的
schema,并通过将 JSON 键及其对应的值存储为列和动态子列。
+VARIANT 类型用于存储半结构化 JSON 数据,可包含不同基础类型(整数、字符串、布尔等)以及一层数组与嵌套对象。写入时会自动基于 JSON Path
推断子列结构与类型,并将高频路径物化为独立子列,充分利用列式存储和向量化执行,兼顾灵活性与性能。
-### Note
+## 使用 VARIANT 类型
-相比 JSON 类型有以下优势:
+### 建表语法
-1. 存储方式不同,JSON 类型是以二进制 JSONB 格式进行存储,整行 JSON 以行存的形式存储到 segment 文件中。而 VARIANT
类型在写入的时候进行类型推断,将写入的 JSON 列存化。比 JSON 类型有更高的压缩比,存储空间更小。
-2. 查询方式不同,查询不需要进行解析。VARIANT 充分利用 Doris 中列式存储、向量化引擎、优化器等组件给用户带来极高的查询性能。
-下面是基于 clickbench 数据测试的结果:
+建表时将列类型声明为 VARIANT:
-| | 存储空间 |
-|--------------|------------|
-| 预定义静态列 | 12.618 GB |
-| VARIANT 类型 | 12.718 GB |
-| JSON 类型 | 35.711 GB |
+```sql
+CREATE TABLE IF NOT EXISTS ${table_name} (
+ k BIGINT,
+ v VARIANT
+)
+PROPERTIES("replication_num" = "1");
+```
-**节省约 65% 存储容量**
+通过 Schema Template 约束部分 Path 的类型(更多见“扩展类型”):
-| 查询次数 | 预定义静态列 | VARIANT 类型 | JSON 类型 |
-|----------------|--------------|--------------|-----------------|
-| 第一次查询 (cold) | 233.79s | 248.66s | **大部分查询超时** |
-| 第二次查询 (hot) | 86.02s | 94.82s | 789.24s |
-| 第三次查询 (hot) | 83.03s | 92.29s | 743.69s |
+```sql
+CREATE TABLE IF NOT EXISTS ${table_name} (
+ k BIGINT,
+ v VARIANT <
+ 'id' : INT, -- path 为 id 的子列被限制为 INT 类型
+ 'message*' : STRING, -- 前缀匹配 message* 的子列被限制为 STRING 类型
+ 'tags*' : ARRAY<TEXT> -- 前缀匹配 tags* 的子列被限制为 ARRAY<TEXT> 类型
+ >
+)
+PROPERTIES("replication_num" = "1");
+```
-[测试集](https://github.com/ClickHouse/ClickBench/blob/main/doris/queries.sql) 一共
43 个查询语句
+### 查询语法
-**查询提速 8+ 倍,查询性能与静态列相当**
+```sql
+-- 访问嵌套字段(返回类型为 VARIANT,需要显式或隐式 CAST 才能聚合/比较)
+SELECT v['properties']['title'] FROM ${table_name};
-## 举例
+-- 聚合前显式 CAST 为确定类型
+SELECT CAST(v['properties']['title'] AS STRING) AS title
+FROM ${table_name}
+GROUP BY title;
-用一个从建表、导数据、查询全周期的例子说明 VARIANT 的功能和用法。
+-- 数组查询示例
+SELECT *
+FROM ${table_name}
+WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');
+```
-**建表语法**
+## 基本类型
+
+VARIANT 自动推断的子列基础类型包括:
+
+<table>
+<tr><td>支持的类型<br/></td></tr>
+<tr><td>TinyInt<br/></td></tr>
+<tr><td>NULL(等价于 JSON 的 null)<br/></td></tr>
+<tr><td>BigInt(64 bit)<br/>Double<br/></td></tr>
+<tr><td>String(Text)<br/></td></tr>
+<tr><td>Jsonb<br/></td></tr>
+<tr><td>Variant(嵌套对象)<br/></td></tr>
+<tr><td>Array<T>(仅支持一维)<br/></td></tr>
+</table>
+
+简单的 INSERT 示例:
+
+```sql
+INSERT INTO vartab VALUES
+ (1, 'null'),
+ (2, NULL),
+ (3, 'true'),
+ (4, '-17'),
+ (5, '123.12'),
+ (6, '1.912'),
+ (7, '"A quote"'),
+ (8, '[-1, 12, false]'),
+ (9, '{ "x": "abc", "y": false, "z": 10 }'),
+ (10, '"2021-01-01"');
+```
-建表语法关键字 VARIANT
+提示:日期/时间戳等非标准 JSON 类型在未指定 Schema 时会以字符串形式存储;如需较高计算效率,建议将其提取为静态列或在 Schema
Template 中明确声明类型。
-``` sql
--- 无索引
-CREATE TABLE IF NOT EXISTS ${table_name} (
- k BIGINT,
- v VARIANT
-)
-table_properties;
+## 扩展类型(Schema Template)
--- 在v列创建索引,可选指定分词方式,默认不分词
-CREATE TABLE IF NOT EXISTS ${table_name} (
- k BIGINT,
- v VARIANT,
- INDEX idx_var(v) USING INVERTED [PROPERTIES("parser" =
"english|unicode|chinese")] [COMMENT 'your comment']
-)
-table_properties;
+除基本类型外,VARIANT 还可通过 Schema Template 声明以下扩展类型:
--- 在v列创建bloom filter
-CREATE TABLE IF NOT EXISTS ${table_name} (
- k BIGINT,
- v VARIANT
+- Number(扩展)
+ - Decimal:Decimal32 / Decimal64 / Decimal128 / Decimal256
+ - LargeInt
+- Datetime
+- Date
+- IPV4 / IPV6
+- Boolean
+- ARRAY<T>(T 为以上任意类型,仅支持一维)
+
+注意:预定义的 Schema 只能在建表时指定,当前不支持通过 ALTER 修改(后续可能支持“新增”子列定义,但不支持修改既有子列类型)。
+
+示例:
+
+```sql
+CREATE TABLE test_var_schema (
+ id BIGINT NOT NULL,
+ v1 VARIANT<
+ 'large_int_val': LARGEINT,
+ 'string_val': STRING,
+ 'decimal_val': DECIMAL(38, 9),
+ 'datetime_val': DATETIME,
+ 'ip_val': IPV4
+ > NULL
)
-...
-properties("replication_num" = "1", "bloom_filter_columns" = "v");
+PROPERTIES ("replication_num" = "1");
+
+INSERT INTO test_var_schema VALUES (1, '{
+ "large_int_val" : "123222222222222222222222",
+ "string_val" : "Hello World",
+ "decimal_val" : 1.11111111,
+ "datetime_val" : "2025-05-16 11:11:11",
+ "ip_val" : "127.0.0.1"
+}');
+
+SELECT variant_type(v1) FROM test_var_schema;
+
++----------------------------------------------------------------------------------------------------------------------------+
+| variant_type(v1)
|
++----------------------------------------------------------------------------------------------------------------------------+
+|
{"datetime_val":"datetimev2","decimal_val":"decimal128i","ip_val":"ipv4","large_int_val":"largeint","string_val":"string"}
|
++----------------------------------------------------------------------------------------------------------------------------+
```
-**查询语法**
+`{"date": 2020-01-01}` 与 `{"ip": 127.0.0.1}` 均为非法 JSON 文本,正确格式应为 `{"date":
"2020-01-01"}` 与 `{"ip": "127.0.0.1"}`。
-``` sql
--- 使用 v['a']['b'] 形式如下,v['properties']['title']类型是VARIANT
-SELECT v['properties']['title'] from ${table_name}
-```
+一旦指定 Schema,若 JSON 实际类型与 Schema 冲突且无法转换,将保存为 NULL。例如:
-### 基于 github events 数据集示例
+```sql
+INSERT INTO test_var_schema VALUES (1, '{
+ "decimal_val" : "1.11111111",
+ "ip_val" : "127.xxxxxx.xxxx",
+ "large_int_val" : "aaabbccc"
+}');
-这里用 github events 数据展示 VARIANT 的建表、导入、查询。
-下面是格式化后的一行数据
+-- 仅 decimal_val 保留
+SELECT * FROM test_var_schema;
-``` json
-{
- "id": "14186154924",
- "type": "PushEvent",
- "actor": {
- "id": 282080,
- "login": "brianchandotcom",
- "display_login": "brianchandotcom",
- "gravatar_id": "",
- "url": "https://api.github.com/users/brianchandotcom",
- "avatar_url": "https://avatars.githubusercontent.com/u/282080?"
- },
- "repo": {
- "id": 1920851,
- "name": "brianchandotcom/liferay-portal",
- "url": "https://api.github.com/repos/brianchandotcom/liferay-portal"
- },
- "payload": {
- "push_id": 6027092734,
- "size": 4,
- "distinct_size": 4,
- "ref": "refs/heads/master",
- "head": "91edd3c8c98c214155191feb852831ec535580ba",
- "before": "abb58cc0db673a0bd5190000d2ff9c53bb51d04d",
- "commits": [""]
- },
- "public": true,
- "created_at": "2020-11-13T18:00:00Z"
-}
++------+-----------------------------+
+| id | v1 |
++------+-----------------------------+
+| 1 | {"decimal_val":1.111111110} |
++------+-----------------------------+
```
-**建表**
+Schema 仅指导“存储层”的持久化类型,计算逻辑仍以实际数据的动态类型为准:
+
+```sql
+-- 实际 v['a'] 的运行时类型仍可能是 STRING
+SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);
+```
-- 创建了三个 VARIANT 类型的列, `actor`,`repo` 和 `payload`
-- 创建表的同时创建了 `payload` 列的倒排索引 `idx_payload`
-- USING INVERTED 指定索引类型是倒排索引,用于加速子列的条件过滤
-- `PROPERTIES("parser" = "english")` 指定采用 english 分词
+通配符与匹配顺序:
-``` sql
-CREATE DATABASE test_variant;
-USE test_variant;
-CREATE TABLE IF NOT EXISTS github_events (
+```sql
+CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
- type VARCHAR(30) NULL,
- actor VARIANT NULL,
- repo VARIANT NULL,
- payload VARIANT NULL,
- public BOOLEAN NULL,
- created_at DATETIME NULL,
- INDEX idx_payload (`payload`) USING INVERTED PROPERTIES("parser" =
"english") COMMENT 'inverted index for payload'
+ v1 VARIANT<
+ 'enumString*' : STRING,
+ 'enum*' : ARRAY<TEXT>,
+ 'ip*' : IPV6
+ > NULL
)
-DUPLICATE KEY(`id`)
-DISTRIBUTED BY HASH(id) BUCKETS 10
-properties("replication_num" = "1");
+PROPERTIES ("replication_num" = "1");
+
+-- 若 enumString1 同时匹配上述两个 pattern,则采用定义顺序中第一个匹配到的类型(STRING)
```
-**需要注意的是:**
+如列名中包含 `*` 且希望按名称精确匹配,可使用:
-:::tip
+```sql
+v1 VARIANT<
+ MATCH_NAME 'enumString*' : STRING
+> NULL
+```
-1. 在 VARIANT 列上创建索引,比如 payload 的子列很多时,可能会造成索引列过多,影响写入性能
-2. 同一个 VARIANT 列的分词属性是相同的,如果您有不同的分词需求,那么可以创建多个 VARIANT 然后分别指定索引属性
+匹配成功的子路径默认会展开为独立列。若匹配子列过多导致列数暴增,建议开启
`variant_enable_typed_paths_to_sparse`(见“配置”)。
-:::
+## 类型冲突与提升规则
-**使用 streamload 导入**
+当同一路径出现不兼容类型(如同一字段既出现整数又出现字符串)时,将提升为 JSONB 类型以避免信息丢失:
-导入 gh_2022-11-07-3.json,这是 github events 一个小时的数据
+```sql
+{"a" : 12345678}
+{"a" : "HelloWorld"}
+-- a 将被提升为 JSONB
+```
-``` shell
-wget
https://qa-build.oss-cn-beijing.aliyuncs.com/regression/variant/gh_2022-11-07-3.json
+转换规则如下表格:
-curl --location-trusted -u root: -T gh_2022-11-07-3.json -H
"read_json_by_line:true" -H "format:json"
http://127.0.0.1:18148/api/test_variant/github_events/_strea
-m_load
+| 源类型 | 当前类型 | 最终类型 |
+| -------------- | -------------- | -------------- |
+| `TinyInt` | `BigInt` | `BigInt` |
+| `TinyInt` | `Double` | `Double` |
+| `TinyInt` | `String` | `JSONB` |
+| `TinyInt` | `Array` | `JSONB` |
+| `BigInt` | `Double` | `JSONB` |
+| `BigInt` | `String` | `JSONB` |
+| `BigInt` | `Array` | `JSONB` |
+| `Double` | `String` | `JSONB` |
+| `Double` | `Array` | `JSONB` |
+| `Array<Double>`| `Array<String>`| `Array<Jsonb>` |
-{
- "TxnId": 2,
- "Label": "086fd46a-20e6-4487-becc-9b6ca80281bf",
- "Comment": "",
- "TwoPhaseCommit": "false",
- "Status": "Success",
- "Message": "OK",
- "NumberTotalRows": 139325,
- "NumberLoadedRows": 139325,
- "NumberFilteredRows": 0,
- "NumberUnselectedRows": 0,
- "LoadBytes": 633782875,
- "LoadTimeMs": 7870,
- "BeginTxnTimeMs": 19,
- "StreamLoadPutTimeMs": 162,
- "ReadDataTimeMs": 2416,
- "WriteDataTimeMs": 7634,
- "CommitAndPublishTimeMs": 55
-}
-```
+若需严格限制子列类型(以稳定索引和存储),请结合 Schema Template 明确声明类型。
-确认导入成功
-
-``` sql
--- 查看行数
-mysql> select count() from github_events;
-+----------+
-| count(*) |
-+----------+
-| 139325 |
-+----------+
-1 row in set (0.25 sec)
-
--- 随机看一条数据
-mysql> select * from github_events limit 1;
-+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------
[...]
-| id | type | actor
| repo
| payload
[...]
-+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------
[...]
-| 25061821748 | PushEvent |
{"gravatar_id":"","display_login":"jfrog-pipelie-intg","url":"https://api.github.com/users/jfrog-pipelie-intg","id":98024358,"login":"jfrog-pipelie-intg","avatar_url":"https://avatars.githubusercontent.com/u/98024358?"}
|
{"url":"https://api.github.com/repos/jfrog-pipelie-intg/jfinte2e_1667789956723_16","id":562683829,"name":"jfrog-pipelie-intg/jfinte2e_1667789956723_16"}
|
{"commits":[{"sha":"334433de436baa198024ef9f55f0647721bcd750","author":{"email":"980243
[...]
-+-------------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------
[...]
-1 row in set (0.23 sec)
-```
+## Variant 索引
+
+### 索引选择
+
+VARIANT 支持对子列建立 BloomFilter 与 Inverted Index 两类索引。
+- 高基数等值/IN 过滤:优先使用 BloomFilter(更省存储、写入更高效)。
+- 需要分词、短语、范围检索:使用 Inverted Index,并根据需求设置 `parser`/`analyzer` 等属性。
-desc 查看 schema 信息,子列会在存储层自动扩展、并进行类型推导
-
-``` sql
-mysql> desc github_events;
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| Field | Type |
Null | Key | Default | Extra |
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| id | BIGINT | No
| true | NULL | |
-| type | VARCHAR(*) |
Yes | false | NULL | NONE |
-| actor | VARIANT |
Yes | false | NULL | NONE |
-| created_at | DATETIME |
Yes | false | NULL | NONE |
-| payload | VARIANT |
Yes | false | NULL | NONE |
-| public | BOOLEAN |
Yes | false | NULL | NONE |
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-6 rows in set (0.07 sec)
-
-mysql> set describe_extend_variant_column = true;
-Query OK, 0 rows affected (0.01 sec)
-
-mysql> desc github_events;
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| Field | Type |
Null | Key | Default | Extra |
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-| id | BIGINT | No
| true | NULL | |
-| type | VARCHAR(*) |
Yes | false | NULL | NONE |
-| actor | VARIANT |
Yes | false | NULL | NONE |
-| actor.avatar_url | TEXT |
Yes | false | NULL | NONE |
-| actor.display_login | TEXT |
Yes | false | NULL | NONE |
-| actor.id | INT |
Yes | false | NULL | NONE |
-| actor.login | TEXT |
Yes | false | NULL | NONE |
-| actor.url | TEXT |
Yes | false | NULL | NONE |
-| created_at | DATETIME |
Yes | false | NULL | NONE |
-| payload | VARIANT |
Yes | false | NULL | NONE |
-| payload.action | TEXT |
Yes | false | NULL | NONE |
-| payload.before | TEXT |
Yes | false | NULL | NONE |
-| payload.comment.author_association | TEXT |
Yes | false | NULL | NONE |
-| payload.comment.body | TEXT |
Yes | false | NULL | NONE |
-....
-+------------------------------------------------------------+------------+------+-------+---------+-------+
-406 rows in set (0.07 sec)
+```sql
+...
+PROPERTIES("replication_num" = "1", "bloom_filter_columns" = "v");
+
+-- 利用 BloomFilter 做等值/IN 过滤
+SELECT * FROM tbl WHERE v['id'] = 12345678;
+SELECT * FROM tbl WHERE v['id'] IN (1, 2, 3);
```
-desc 可以指定 partition 查看某个 partition 的 schema,语法如下
+给 VARIANT 列创建 Inverted Index 后,所有子列将继承相同的索引属性(如分词方式)。
+
+```sql
+CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT,
+ INDEX idx_v(v) USING INVERTED PROPERTIES("parser" = "english")
+);
+-- 全部子列继承 english 分词属性
+SELECT * FROM tbl WHERE v['id_1'] MATCH 'Doris';
+SELECT * FROM tbl WHERE v['id_2'] MATCH 'Apache';
```
-DESCRIBE ${table_name} PARTITION ($partition_name);
+
+### 根据子路径指定索引
+
+在 3.1.x/4.0 及之后的版本中,可为 VARIANT 的部分子列单独指定索引属性,甚至在同一路径上同时配置“分词与不分词”的两种倒排索引。指定
Path 索引需配合 Path 类型(Schema Template)使用。
+
+```sql
+-- 常用属性:field_pattern(目标子路径)、analyzer、parser、support_phrase 等
+CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT<'content' : STRING>,
+ INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english",
"field_pattern" = "content"),
+ INDEX idx_v(v) USING INVERTED PROPERTIES("field_pattern" = "content")
+);
+
+-- v.content 同时具备分词与不分词的倒排索引
+SELECT * FROM tbl WHERE v['content'] MATCH 'Doris';
+SELECT * FROM tbl WHERE v['content'] = 'Doris';
```
-**查询**
-
-:::tip
-
-**注意**
-如使用过滤和聚合等功能来查询子列,需要对子列进行额外的 cast 操作(因为存储类型不一定是固定的,需要有一个 SQL 统一的类型)。
-例如 SELECT * FROM tbl where CAST(var['titile'] as text) MATCH "hello world"
-以下简化的示例说明了如何使用 VARIANT 进行查询
-
-:::
-
-下面是典型的三个查询场景:
-
-1. 从 github_events 表中获取 top 5 star 数的代码库
-
-``` sql
-mysql> SELECT
- -> cast(repo['name'] as text) as repo_name, count() AS stars
- -> FROM github_events
- -> WHERE type = 'WatchEvent'
- -> GROUP BY repo_name
- -> ORDER BY stars DESC LIMIT 5;
-+--------------------------+-------+
-| repo_name | stars |
-+--------------------------+-------+
-| aplus-framework/app | 78 |
-| lensterxyz/lenster | 77 |
-| aplus-framework/database | 46 |
-| stashapp/stash | 42 |
-| aplus-framework/image | 34 |
-+--------------------------+-------+
-5 rows in set (0.03 sec)
+支持通配符的 Path 索引:
+
+```sql
+CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT<'pattern_*' : STRING>,
+ INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english",
"field_pattern" = "pattern_*"),
+ INDEX idx_v(v) USING INVERTED -- 全局指定非分词索引
+);
+
+SELECT * FROM tbl WHERE v['pattern_1'] MATCH 'Doris';
+SELECT * FROM tbl WHERE v['pattern_1'] = 'Doris';
```
-2. 获取评论中包含 doris 的数量
-
-``` sql
--- implicit cast `payload['comment']['body']` to string type
-mysql> SELECT
- -> count() FROM github_events
- -> WHERE payload['comment']['body'] MATCH 'doris';
-+---------+
-| count() |
-+---------+
-| 3 |
-+---------+
-1 row in set (0.04 sec)
+注意:2.1.7+ 仅支持 InvertedIndex V2 属性(文件更少、写入 IOPS 更低,适配存算分离)。2.1.8+ 不再支持离线 Build
Index 构建。
+
+### 索引失效问题
+
+1. 类型变更导致索引丢失:子列类型发生不兼容变更(如 INT→JSONB)会丢失索引。可通过 Schema Template 固定类型与索引。
+2. 查询类型不匹配:
+ ```sql
+ -- v['id'] 实际为 STRING,按 INT 进行等值会导致索引失效
+ SELECT * FROM tbl WHERE v['id'] = 123456;
+ ```
+3. 索引配置错误:索引作用于“子列”,对 VARIANT 整体无效。
+ ```sql
+ -- v 本身不具备索引能力
+ SELECT * FROM tbl WHERE v MATCH 'Doris';
+
+ -- 若需对整体 JSON 文本建索引,可额外存字符串列并建索引
+ CREATE TABLE IF NOT EXISTS tbl (
+ k BIGINT,
+ v VARIANT,
+ v_str STRING,
+ INDEX idx_v_str(v_str) USING INVERTED PROPERTIES("parser" = "english")
+ );
+ SELECT * FROM tbl WHERE v_str MATCH 'Doris';
+ ```
+
+## insert 与导入
+
+### INSERT INTO VALUES
+
+```sql
+CREATE TABLE IF NOT EXISTS variant_tbl (
+ k BIGINT,
+ v VARIANT
+) PROPERTIES("replication_num" = "1");
+
+INSERT INTO variant_tbl VALUES (1, '{"a" : 123}');
+
+select * from variant_tbl;
++------+-----------+
+| k | v |
++------+-----------+
+| 1 | {"a":123} |
++------+-----------+
+
+-- 其中 v['a'] 是 Variant 类型
+select v['a'] from variant_tbl;
++--------+
+| v['a'] |
++--------+
+| 123 |
++--------+
+
+-- v['a']['no_such_key'] 对于不存在的 JSON 键,将返回 NULL
+select v['a']['no_such_key'] from variant_tbl;;
++-----------------------+
+| v['a']['no_such_key'] |
++-----------------------+
+| NULL |
++-----------------------+
+
```
-3. 查询 comments 最多的 issue 号以及对应的库
-
-``` sql
-mysql> SELECT
- -> cast(repo['name'] as string) as repo_name,
- -> cast(payload['issue']['number'] as int) as issue_number,
- -> count() AS comments,
- -> count(
- -> distinct cast(actor['login'] as string)
- -> ) AS authors
- -> FROM github_events
- -> WHERE type = 'IssueCommentEvent' AND (cast(payload['action'] as string)
= 'created') AND (cast(payload['issue']['number'] as int) > 10)
- -> GROUP BY repo_name, issue_number
- -> HAVING authors >= 4
- -> ORDER BY comments DESC, repo_name
- -> LIMIT 50;
-+--------------------------------------+--------------+----------+---------+
-| repo_name | issue_number | comments | authors |
-+--------------------------------------+--------------+----------+---------+
-| facebook/react-native | 35228 | 5 | 4 |
-| swsnu/swppfall2022-team4 | 27 | 5 | 4 |
-| belgattitude/nextjs-monorepo-example | 2865 | 4 | 4 |
-+--------------------------------------+--------------+----------+---------+
-3 rows in set (0.03 sec)
+### 导入(Stream Load)
+
+```bash
+# 以按行 JSON 为例(每行一条 JSON 记录)
+curl --location-trusted -u root: -T gh_2022-11-07-3.json \
+ -H "read_json_by_line:true" -H "format:json" \
+ http://127.0.0.1:8030/api/test_variant/github_events/_stream_load
```
-### 使用限制和最佳实践
+参考 [variant](../../../../data-operate/import/complex-types/variant.md)
-**VARIANT 类型的使用有以下限制:**
-VARIANT 动态列与预定义静态列几乎一样高效。处理诸如日志之类的数据,在这类数据中,经常通过动态属性添加字段(例如 Kubernetes
中的容器标签)。但是解析 JSON 和推断类型会在写入时产生额外开销。因此,我们建议保持单次导入列数在 1000 以下。
+导入完成后可用 `SELECT count(*)` 或 `SELECT * ... LIMIT 1` 验证。为提升高并发导入性能,推荐建表选择 RANDOM
分桶并开启 Group Commit(参见官方“Group Commit”文档)。
-尽可能保证类型一致,Doris 会自动进行如下兼容类型转换,当字段无法进行兼容类型转换时会统一转换成 JSONB 类型。JSONB 列的性能与
int、text 等列性能会有所退化。
+## 支持的运算与 CAST 规则
-1. tinyint->smallint->int->bigint,整形可以按照箭头做类型提升
-2. float->double,浮点数按照箭头做类型提升
-3. text,字符串类型
-4. JSON,二进制 JSON 类型
+- VARIANT 本身不支持与其他类型直接比较/运算,两个 VARIANT 之间也不支持直接比较。
+- 如需比较、过滤、聚合、排序,请对子列显式或隐式 CAST 到确定类型。
-上诉类型无法兼容时,会变成 JSON 类型防止类型信息丢失,如果您需要在 VARIANT 中设置严格的 schema,即将推出 VARIANT
MAPPING 机制
+```sql
+-- 显式 CAST
+SELECT CAST(v['arr'] AS ARRAY<TEXT>) FROM tbl;
+SELECT * FROM tbl WHERE CAST(v['decimal'] AS DECIMAL(27, 9)) = 1.111111111;
+SELECT * FROM tbl WHERE CAST(v['date'] AS DATE) = '2021-01-02';
-其它限制如下:
+-- 隐式 CAST
+SELECT * FROM tbl WHERE v['bool'];
+SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
+```
-- VARIANT 列只能创建倒排索引或者 bloom filter 来加速过滤
-- **推荐使用 RANDOM 模式和[Group
Commit](../../../../data-operate/import/group-commit-manual) 模式,写入性能更高效**
-- 日期、decimal 等非标准 JSON 类型会被默认推断成字符串类型,所以尽可能从 VARIANT 中提取出来,用静态类型,性能更好
-- 2 维及其以上的数组列存化会被存成 JSONB 编码,性能不如原生数组
-- 不支持作为主键或者排序键
-- 查询过滤、聚合需要带 cast,存储层会根据存储类型和 cast 目标类型来消除 cast 操作,加速查询。
-- 读取整个 VARIANT
列时,会扫描其所有子字段。如果该列包含大量子字段,这可能导致显著的扫描开销,从而影响查询性能。为了优化在需要检索整个列时的性能,建议添加一个 STRING 或
JSONB 类型的附加列,用于存储原始的 JSON 字符串。这种方法允许您直接查询整个 JSON 对象,从而减少扫描成本。
-``` sql
--- 导致扫描 data_variant 的所有子字段
+- VARIANT 本身不可直接用于 ORDER BY、GROUP BY、JOIN KEY 或聚合参数;对子列 CAST 后可正常使用。
+- 字符串类型可隐式转换为 VARIANT。
+
+| VARIANT | Castable | Coercible | Conversion Function |
+| --------------- | -------- | --------- | ------------------- |
+| `ARRAY` | ✔ | ❌ | |
+| `BOOLEAN` | ✔ | ✔ | |
+| `DATE/DATETIME` | ✔ | ✔ | |
+| `FLOAT` | ✔ | ✔ | |
+| `IPV4/IPV6` | ✔ | ✔ | |
+| `DECIMAL` | ✔ | ✔ | |
+| `MAP` | ❌ | ❌ | |
+| `TIMESTAMP` | ✔ | ✔ | |
+| `VARCHAR` | ✔ | ✔ | `PARSE_TO_JSON` |
+| `JSON` | ✔ | ✔ | |
+
+## 限制
+
+- `variant_max_subcolumns_count`:默认 0(不限制 Path 物化列数)。建议在生产设置为 2048(Tablet
级别)以控制列数。超过阈值后,低频/稀疏路径会被收敛到共享数据结构,从该结构查询可能带来性能下降(详见“配置”)。
+- 若 Schema Template 指定了 Path 类型,则该 Path 会被强制提取;当
`variant_enable_typed_paths_to_sparse = true` 时,它也会计入阈值,可能被收敛到共享结构。
+- JSON key 长度 ≤ 255。
+- 不支持作为主键或排序键。
+- 不支持与其他类型嵌套(如 `Array<Variant>`、`Struct<Variant>`)。
+- 读取整个 VARIANT 列会扫描所有子字段。若列包含大量子字段,建议额外存储原始 JSON 的 STRING/JSONB 列,以优化如 `LIKE`
等整体匹配:
+
+```sql
CREATE TABLE example_table (
id INT,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_variant LIKE '%doris%';
--- 对于 `LIKE` 查询,性能更佳
+-- 更优做法:额外保留原始 JSON 字符串列用于整体检索
CREATE TABLE example_table (
id INT,
data_string STRING,
@@ -363,27 +412,88 @@ CREATE TABLE example_table (
SELECT * FROM example_table WHERE data_string LIKE '%doris%';
```
-**针对列数限制的调优手段:**
+## 配置
-注意如果是 超过 5000 子列,对内存和配置有比较高的要求,单机尽可能 128G 以上内存,核数 32C 以上
-1. BE 配置`variant_max_merged_tablet_schema_size=n` n 大于实际的列数(不推荐超过 10000)
-2. 需要注意的是,提取的列数过多会导致 compaction 的压力过大(需要控制导入的吞吐)。根据内存使用情况增大客户端导入的 batch_size
可以降低 compaction 的写放大(或者推荐使用 group_commit,表 properties
配置,适当增加`group_commit_interval_ms` 和 `group_commit_data_bytes`)
-3. 如果查询没有分桶裁剪的需求,可以使用 random 分桶,开启
[load_to_single_tablet](../../../../table-design/data-partitioning/data-bucketing#bucketing)
导入(导入的配置),可以减少 compaction 写放大
-4. BE 配置 根据导入压力调整 `max_cumu_compaction_threads`,至少保证 8 个线程
-5. BE 配置`vertical_compaction_num_columns_per_group=500`提升分组 compaction
效率,但是会增加内存开销销
-6. BE 配置`segment_cache_memory_percentage=20`增加 segment 缓存的容量,提升元数据缓存效率
-7. 注意关注 Compaction Score,Score 过高反应 Compaction 做不过来(需要适当降低导入压力)
-8. `SELECT *` 或者 `SELECT variant` 会导致集群整体压力明显上升,甚至出现超时或者内存超限。建议查询带上 Path 信息例如
`SELECT variant['path_1']`。
+在 3.1+ 支持在 VARIANT 类型上声明列级别属性:
-### FAQ
-1. Stream Load 报错: [CANCELLED][INTERNAL_ERROR]tablet error:
[DATA_QUALITY_ERROR]Reached max column size limit 2048。
+```sql
+CREATE TABLE example_table (
+ id INT,
+ data_variant VARIANT<
+ 'path_1' : INT,
+ 'path_2' : STRING,
+ properties(
+ 'variant_max_subcolumns_count' = '2048',
+ 'variant_enable_typed_paths_to_sparse' = 'true'
+ )
+ >
+);
+```
+
+<table>
+<tr><td>属性<br/></td><td>描述<br/></td></tr>
+<tr><td>`variant_max_subcolumns_count`<br/></td><td>控制 Path
物化列数的上限;超过后新增路径可能存放于共享数据结构。默认 0 表示不限,建议设置为 2048;不推荐超过 10000。<br/></td></tr>
+<tr><td>`variant_enable_typed_paths_to_sparse`<br/></td><td>默认指定了 Path 类型后,该
Path 一定会被提取(不计入 `variant_max_subcolumns_count`)。设置为 `true`
后也会计入阈值,可能被收敛到共享结构。<br/></td></tr>
+</table>
+
+达到上限后的行为与调优建议:
+
+1. 超过上限后,新路径写入共享结构;Rowset 合并后也可能触发部分路径回收为共享结构。
+2. 系统会优先保留非空比例高、访问频率高的路径为物化列。
+3. 若接近 10000 物化列,对硬件要求较高(建议单机 ≥128G 内存、≥32C)。
+4. 写入侧调优:适度增大客户端 batch_size,或使用 Group Commit(按需增大
`group_commit_interval_ms`/`group_commit_data_bytes`)。
+5. 若无分桶裁剪需求,建议采用 RANDOM 分桶,并开启 single tablet 导入以降低 compaction 写放大。
+6. BE 配置可按导入压力调整 `max_cumu_compaction_threads`(建议
≥8)、`vertical_compaction_num_columns_per_group=500`(提升纵向合并效率,增加内存占用)、`segment_cache_memory_percentage=20`(提升元数据缓存命中)。
+7. 关注 Compaction Score;若持续上升说明 Compaction 跟不动,需要降低导入压力。
+8. 避免大范围 `SELECT *` 或直接扫描 VARIANT;尽量使用具体路径投影 `SELECT v['path']`。
+
+另:当出现 Stream Load 报错 `[DATA_QUALITY_ERROR]Reached max column size limit 2048`
时(只有 2.1.x 和 3.0.x 版本会出现该报错),说明合并后的 Tablet Schema 达到列数上限。可按需调整 BE 配置
`variant_max_merged_tablet_schema_size`(不建议超过 4096,需较高配置机器)。
+
+## 查看列数、列类型
+
+方案一:使用 `variant_type` 查看行级 Schema(开销更大但更精确):
+
+```sql
+SELECT variant_type(v) FROM variant_tbl;
+```
+
+方案二:扩展 `DESC` 展示已物化的子列(仅展示被提取的路径):
+
+```sql
+SET describe_extend_variant_column = true;
+DESC variant_tbl;
+```
+
+两种方式可结合使用:方案一精确、方案二高效。
- 由于 Compaction 和元信息存储限制,VARIANT 类型会限制列数,默认 2048 列,可以适当调整 BE 配置
`variant_max_merged_tablet_schema_size`,但是不建议超过 4096(依赖较高配置机型)
+## 对比 JSON 类型
-2. VARIANT 类型中的 null(例如 `{"key": null}` )和 SQL 中的 NULL(即 IS NULL)有区别吗?
+- 存储:JSON 类型以 JSONB(行存)写入;VARIANT 写入时类型推断并列存化,压缩率更高、存储更小。
+- 查询:JSON 需解析;VARIANT 直接列式扫描,通常显著更快。
+
+改造的 ClickBench 测试结果(43 条查询):
+- 存储:VARIANT 相比 JSON 约节省 65% 存储空间。
+- 查询:VARIANT 较 JSON 提速 8 倍以上,性能接近静态列。
+
+**存储空间**
+
+| 类型 | 存储空间 |
+| -------------- | ---------- |
+| 预定义静态列 | 12.618 GB |
+| VARIANT 类型 | 12.718 GB |
+| JSON 类型 | 35.711 GB |
+
+**节省约 65% 存储容量**
- 没有区别,在 VARIANT 类型中,它们被视为等价的。
+| 查询次数 | 预定义静态列 | VARIANT 类型 | JSON 类型 |
+| -------------- | ------------ | ------------ | -------------- |
+| 第一次查询 (cold) | 233.79s | 248.66s | 大部分查询超时 |
+| 第二次查询 (hot) | 86.02s | 94.82s | 789.24s |
+| 第三次查询 (hot) | 83.03s | 92.29s | 743.69s |
-### Keywords
+## FAQ
- VARIANT
+1. VARIANT 中的 `null` 与 SQL `NULL` 有区别吗?
+ - 没有区别,两者等价。
+2. 为什么我的查询/索引没有生效?
+ - 请检查是否对路径做了正确的 CAST、是否因为类型冲突被提升为 JSONB、或是否误以为给 VARIANT“整体”建的索引可用于子列。
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]