Calvin Kirs created GSOC-252:
--------------------------------

             Summary: [GSoC][Doris]Dictionary encoding optimization
                 Key: GSOC-252
                 URL: https://issues.apache.org/jira/browse/GSOC-252
             Project: Comdev GSOC
          Issue Type: New Feature
            Reporter: Calvin Kirs


h2. Background

Apache Doris is a modern data warehouse for real-time analytics.
It delivers lightning-fast analytics on real-time data at scale.
h2. Objectives

Dictionary encoding optimization
To save storage space, Doris uses dictionary encoding when storing string-type 
data in the storage layer if the cardinality is relatively low. Dictionary 
encoding involves mapping string values to integer values using a dictionary. 
The data can be stored directly as integers, and the dictionary information is 
stored separately. When reading the data, the integers are converted back to 
their corresponding string values based on the dictionary.

The storage layer doesn't know whether a column has low or high cardinality 
when the data comes in. Currently, the implementation encodes the first page 
using dictionary encoding, and if the dictionary becomes too large, it 
indicates a column with high cardinality. Subsequent pages will not use 
dictionary encoding. However, even for columns with high cardinality, a 
dictionary page is still retained, which doesn't save storage space and adds 
additional memory overhead during reading as well as extra CPU overhead during 
decoding.
Optimizations can be made to improve the memory and CPU overhead caused by 
dictionary encoding.
h2. 
Recommended Skills
 
Familiar with C++ programming
Familiar with the storage layer of Doris
 
h2. Mentor
 
Mentor: Xin Liao, Apache Doris Committer, liaoxin...@gmail.com
Mentor: YongQiang Yang, Apache Doris PMC Member, dataroar...@gmail.com
Mailing List: d...@doris.apache.org
Website: https://doris.apache.org
Source Code: https://github.com/apache/doris
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: gsoc-unsubscr...@community.apache.org
For additional commands, e-mail: gsoc-h...@community.apache.org

Reply via email to