eldenmoon opened a new issue, #26225:
URL: https://github.com/apache/doris/issues/26225

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   ## What is variant type?
   To address the challenges posed by semi-structured data, Doris has 
introduced a new data type called VARIANT. This type is designed to store 
semi-structured data, such as JSON, allowing for the storage of complex data 
structures with different data types (e.g., integers, strings, booleans) 
without the need to define specific columns in the table structure beforehand. 
It is expected to be the preferred data type for semi-structured data, 
providing users with a more efficient mechanism for data processing. This 
introduces a fundamental change in storage and querying compared to Doris' 
traditional types like String and JSONB.
   
   The VARIANT type is particularly useful for handling complex nested 
structures, which may change dynamically. During the write process, this type 
can automatically infer column information based on the structure and types, 
and merge them into the existing table schema. By storing JSON keys and their 
corresponding values as columns and dynamic subcolumns, VARIANT fully leverages 
Doris' columnar storage, vectorized engine, optimizer, and other components to 
deliver exceptional query performance and cost-effectiveness in terms of 
storage.
   ## Design detailes
   
   ### write process
   Memtable Flush Column Splitting Logic
   
   
![image](https://github.com/apache/doris/assets/64513324/a0a99bb1-c26c-438a-a231-850d7294823c)
   
   Type Conflicts handle rule: find the least common type
   
   Numeric Types: Tinyint -> Smallint -> Int -> Bigint. If encountering a float 
type, look for the higher bit type.
   String Type
   Array Type
   JSON Type (Stored in actual format as JSONB)
   
   The JSON type is the most common type for all types.
   
   The storage layer as bellow:
   The above logic ensures that during a memtable flush, columns with the same 
name have identical types. Different flushes may have different types.
   Segment stores data and types of each colunn. The same column in the same 
rowset may have different types. The rowset meta stores the deduced common type 
and the union of columns. For JSON types, the common type covers all types.
   Example:
   Segment 1: a(int), b(float), c(string)
   Segment 2: a(bigint), b(int), c(int), d(array<int>)
   Rowset Schema: a(bigint), b(double), c(variant/json), d(array<int>)
   
   Before doing compaction, needs merge(find the least commn types of each 
subcolumn and merge the all) types of variant between each rowsets and produce 
the output schema as the final rowset schema.Eg.
   
![image](https://github.com/apache/doris/assets/64513324/18a03266-5401-4661-9d76-160349f6adf6)
   After merged, the output schema became:
   
![image](https://github.com/apache/doris/assets/64513324/507d47f9-01ca-48ad-af92-5004cefcf943)
   
   
   
   
   ### read process
   
   ### How to use 
   
   ### Use case
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to