Calvin Kirs created GSOC-252: -------------------------------- Summary: [GSoC][Doris]Dictionary encoding optimization Key: GSOC-252 URL: https://issues.apache.org/jira/browse/GSOC-252 Project: Comdev GSOC Issue Type: New Feature Reporter: Calvin Kirs
h2. Background Apache Doris is a modern data warehouse for real-time analytics. It delivers lightning-fast analytics on real-time data at scale. h2. Objectives Dictionary encoding optimization To save storage space, Doris uses dictionary encoding when storing string-type data in the storage layer if the cardinality is relatively low. Dictionary encoding involves mapping string values to integer values using a dictionary. The data can be stored directly as integers, and the dictionary information is stored separately. When reading the data, the integers are converted back to their corresponding string values based on the dictionary. The storage layer doesn't know whether a column has low or high cardinality when the data comes in. Currently, the implementation encodes the first page using dictionary encoding, and if the dictionary becomes too large, it indicates a column with high cardinality. Subsequent pages will not use dictionary encoding. However, even for columns with high cardinality, a dictionary page is still retained, which doesn't save storage space and adds additional memory overhead during reading as well as extra CPU overhead during decoding. Optimizations can be made to improve the memory and CPU overhead caused by dictionary encoding. h2. Recommended Skills Familiar with C++ programming Familiar with the storage layer of Doris h2. Mentor Mentor: Xin Liao, Apache Doris Committer, liaoxin...@gmail.com Mentor: YongQiang Yang, Apache Doris PMC Member, dataroar...@gmail.com Mailing List: d...@doris.apache.org Website: https://doris.apache.org Source Code: https://github.com/apache/doris -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: gsoc-unsubscr...@community.apache.org For additional commands, e-mail: gsoc-h...@community.apache.org