Re: [Discussion] Partition Optimization

Ajantha Bhat Wed, 28 Oct 2020 23:03:04 -0700

+1,

Not keeping the partition values as a column (as the folder name already
has it) is a great way to reduce the store size.
we might have to handle compatibility and support refresh table also.


Apache Iceberg has a bit matured concept called *hidden partitioning, *where
they also maintain the relationship between columns and supports dynamic
rollup of partitions based on the query. You can analyze this (
https://iceberg.apache.org/partitioning/)

Thanks,
Ajantha

On Thu, Oct 15, 2020 at 2:22 AM Mahesh Raju Somalaraju <
[email protected]> wrote:

> Dear Community,
>
> This mail is regarding partition optimization.
>
> *Current behaviour:* Currently partition column information is storing in
> data files after load/insert. When we query for partition data we are
> fetching from data files and filling the row.
>
> *Proposed optimization:* In this enhancement the idea is to remove/exclude
> partition column information while loading/insert[writing]. it means data
> files does not contain any partition column information. When we query for
> partition data[readers] fill the partition information with help from
> projection partiton columns[pass to BlockExecutionInfo and get it] and
> blockId[which has partition column name and value] and fill the row and
> return.
>
> *Benefits*:
> 1) query performance should be faster
> 2) store size should be less compare to old behavior.
>
> Please have a look *WIP PR[#1]* is raised for the same and we are working
> on CI failures currently.
>
> #1 https://github.com/apache/carbondata/pull/3695/
>
> Please provide your valuable inputs and suggestions. Thank you in advance !
>
> Thanks & Regards
> -Mahesh Raju Somalaraju
> github id: maheshrajus
>

Re: [Discussion] Partition Optimization

Reply via email to