[ https://issues.apache.org/jira/browse/GSOC-301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Calvin Kirs updated GSOC-301: ----------------------------- Description: h2. *Synopsis* Apache Doris is a real-time data warehouse that utilizes columnar storage. Currently, Doris applies default encoding methods based on column data types. This project aims to evaluate the efficiency of these default encodings (e.g., encoding/decoding time and compression ratios) using benchmark datasets like TPC-DS, HTTP logs, and TPC-H. The findings will guide optimizations to improve performance. h2. *Key Objectives* * *A.* Develop a tool to evaluate encoding efficiency. The tool will take a column of data and an encoding method as input and output metrics such as compression ratio and processing speed. * *B.* Optimize dictionary encoding for string columns. Current implementations apply dictionary encoding by default without evaluating data suitability, leading to inefficiencies for non-dictionary-friendly data. * *C.* Assess the effectiveness of BitShuffle encoding for enhancing downstream compression. h2. *Benefits to the Community* * Improve data compression efficiency in Apache Doris. * Enhance query performance through optimized encoding/decoding. h2. *Technical Details* * {*}Languages/Tools{*}: C++ for encoding logic, GitHub for version control. * {*}Methodology{*}: ** Benchmark existing encoding methods (e.g., dictionary, BitShuffle). ** Develop an evaluation framework to measure compression ratios and processing overhead. ** Implement optimizations for specific data types and use cases. h2. *Timeline (12+ Weeks, Full-Time Commitment - 30 hrs/week)* # *Community Bonding (Weeks 1-2)* ## Engage with mentors and the Doris community. ## Set up the development environment and study the codebase. ## Document current column encoding strategies for all data types. # *Phase 1: Planning & Initial Development (Weeks 3-6)* ## Build a tool to evaluate encoding schemes across data types. ## Run benchmarks using TPC-DS, HTTP logs, and TPC-H datasets. # *Phase 2: Analysis & Optimization (Weeks 7-10)* ## {*}Optimize Dictionary Encoding{*}: Automatically detect and skip non-dictionary-friendly data (e.g., high-cardinality strings). ## {*}BitShuffle Evaluation{*}: Quantify its impact on compression ratios and processing speed. ## Address additional optimization opportunities identified during analysis. # *Phase 3: Finalization & Refinement (Weeks 11-12+)* # Refine code and documentation based on community feedback. # Submit PRs and ensure their merge into the Doris master branch. 🔹 {*}Total Effort{*}: 350+ hours h2. *Expected Outcomes* # A tool to evaluate encoding efficiency for all Doris column types. # Optimized dictionary encoding logic with automated suitability checks. # Improved BitShuffle integration for enhanced compression. # Additional optimizations identified during the project. This project will strengthen Apache Doris’s performance in real-time analytics scenarios while fostering collaboration within the open-source community. > Apache Doris:Evaluating Column Encoding and Optimization > --------------------------------------------------------- > > Key: GSOC-301 > URL: https://issues.apache.org/jira/browse/GSOC-301 > Project: Comdev GSOC > Issue Type: Wish > Reporter: Calvin Kirs > Priority: Major > > h2. *Synopsis* > Apache Doris is a real-time data warehouse that utilizes columnar storage. > Currently, Doris applies default encoding methods based on column data types. > This project aims to evaluate the efficiency of these default encodings > (e.g., encoding/decoding time and compression ratios) using benchmark > datasets like TPC-DS, HTTP logs, and TPC-H. The findings will guide > optimizations to improve performance. > h2. *Key Objectives* > * > *A.* Develop a tool to evaluate encoding efficiency. The tool will take a > column of data and an encoding method as input and output metrics such as > compression ratio and processing speed. > * > *B.* Optimize dictionary encoding for string columns. Current implementations > apply dictionary encoding by default without evaluating data suitability, > leading to inefficiencies for non-dictionary-friendly data. > * > *C.* Assess the effectiveness of BitShuffle encoding for enhancing downstream > compression. > h2. *Benefits to the Community* > * > Improve data compression efficiency in Apache Doris. > * > Enhance query performance through optimized encoding/decoding. > h2. *Technical Details* > * > {*}Languages/Tools{*}: C++ for encoding logic, GitHub for version control. > * > {*}Methodology{*}: > ** > Benchmark existing encoding methods (e.g., dictionary, BitShuffle). > ** > Develop an evaluation framework to measure compression ratios and processing > overhead. > ** > Implement optimizations for specific data types and use cases. > h2. *Timeline (12+ Weeks, Full-Time Commitment - 30 hrs/week)* > # > *Community Bonding (Weeks 1-2)* > ## > Engage with mentors and the Doris community. > ## > Set up the development environment and study the codebase. > ## > Document current column encoding strategies for all data types. > # > *Phase 1: Planning & Initial Development (Weeks 3-6)* > ## > Build a tool to evaluate encoding schemes across data types. > ## > Run benchmarks using TPC-DS, HTTP logs, and TPC-H datasets. > # > *Phase 2: Analysis & Optimization (Weeks 7-10)* > ## > {*}Optimize Dictionary Encoding{*}: Automatically detect and skip > non-dictionary-friendly data (e.g., high-cardinality strings). > ## > {*}BitShuffle Evaluation{*}: Quantify its impact on compression ratios and > processing speed. > ## > Address additional optimization opportunities identified during analysis. > # > *Phase 3: Finalization & Refinement (Weeks 11-12+)* > # > Refine code and documentation based on community feedback. > # > Submit PRs and ensure their merge into the Doris master branch. > > 🔹 {*}Total Effort{*}: 350+ hours > h2. *Expected Outcomes* > # > A tool to evaluate encoding efficiency for all Doris column types. > # > Optimized dictionary encoding logic with automated suitability checks. > # > Improved BitShuffle integration for enhanced compression. > # > Additional optimizations identified during the project. > This project will strengthen Apache Doris’s performance in real-time > analytics scenarios while fostering collaboration within the open-source > community. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: gsoc-unsubscr...@community.apache.org For additional commands, e-mail: gsoc-h...@community.apache.org