+1, Not keeping the partition values as a column (as the folder name already has it) is a great way to reduce the store size. we might have to handle compatibility and support refresh table also.
Apache Iceberg has a bit matured concept called *hidden partitioning, *where they also maintain the relationship between columns and supports dynamic rollup of partitions based on the query. You can analyze this ( https://iceberg.apache.org/partitioning/) Thanks, Ajantha On Thu, Oct 15, 2020 at 2:22 AM Mahesh Raju Somalaraju < [email protected]> wrote: > Dear Community, > > This mail is regarding partition optimization. > > *Current behaviour:* Currently partition column information is storing in > data files after load/insert. When we query for partition data we are > fetching from data files and filling the row. > > *Proposed optimization:* In this enhancement the idea is to remove/exclude > partition column information while loading/insert[writing]. it means data > files does not contain any partition column information. When we query for > partition data[readers] fill the partition information with help from > projection partiton columns[pass to BlockExecutionInfo and get it] and > blockId[which has partition column name and value] and fill the row and > return. > > *Benefits*: > 1) query performance should be faster > 2) store size should be less compare to old behavior. > > Please have a look *WIP PR[#1]* is raised for the same and we are working > on CI failures currently. > > #1 https://github.com/apache/carbondata/pull/3695/ > > Please provide your valuable inputs and suggestions. Thank you in advance ! > > Thanks & Regards > -Mahesh Raju Somalaraju > github id: maheshrajus >
