eldenmoon opened a new issue, #26225: URL: https://github.com/apache/doris/issues/26225
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Description ## What is variant type? To address the challenges posed by semi-structured data, Doris has introduced a new data type called VARIANT. This type is designed to store semi-structured data, such as JSON, allowing for the storage of complex data structures with different data types (e.g., integers, strings, booleans) without the need to define specific columns in the table structure beforehand. It is expected to be the preferred data type for semi-structured data, providing users with a more efficient mechanism for data processing. This introduces a fundamental change in storage and querying compared to Doris' traditional types like String and JSONB. The VARIANT type is particularly useful for handling complex nested structures, which may change dynamically. During the write process, this type can automatically infer column information based on the structure and types, and merge them into the existing table schema. By storing JSON keys and their corresponding values as columns and dynamic subcolumns, VARIANT fully leverages Doris' columnar storage, vectorized engine, optimizer, and other components to deliver exceptional query performance and cost-effectiveness in terms of storage. ## Design detailes ### write process Memtable Flush Column Splitting Logic  Type Conflicts handle rule: find the least common type Numeric Types: Tinyint -> Smallint -> Int -> Bigint. If encountering a float type, look for the higher bit type. String Type Array Type JSON Type (Stored in actual format as JSONB) The JSON type is the most common type for all types. The storage layer as bellow: The above logic ensures that during a memtable flush, columns with the same name have identical types. Different flushes may have different types. Segment stores data and types of each colunn. The same column in the same rowset may have different types. The rowset meta stores the deduced common type and the union of columns. For JSON types, the common type covers all types. Example: Segment 1: a(int), b(float), c(string) Segment 2: a(bigint), b(int), c(int), d(array<int>) Rowset Schema: a(bigint), b(double), c(variant/json), d(array<int>) Before doing compaction, needs merge(find the least commn types of each subcolumn and merge the all) types of variant between each rowsets and produce the output schema as the final rowset schema.Eg.  After merged, the output schema became:  ### read process ### How to use ### Use case _No response_ ### Related issues _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
