Re: [Discussion] Blocklet DataMap caching in driver

manish gupta Sat, 23 Jun 2018 07:11:51 -0700

Thanks for the feedback Jacky.

As of now we have min/max at each block and blocklet level and while
loading the metadata cache we compute the task level min/max. Segment Level
min/max is not considered as of now but surely this solution can be
enhanced to consider segment level min/max.


We can discuss further on this in detail and decide whether to consider now
or enhance it in near future.

Regards
Manish Gupta

On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <jacky.li...@qq.com> wrote:

> Hi Manish,
>
> +1 for solution 1 for next carbon version. Solution 2 should be also
> considered, but for a future version after next version.
>
> In my previous observation, many scenario user will filter on time range,
> and since Carbon’s segment is per incremental load which makes it related
> to time normally. So if we can have minmax for sort_columns for segment
> level, I think it will further help making driver index minimum. Will you
> also consider this?
>
> Regards,
> Jacky
>
>
> > 在 2018年6月21日，下午5:24，manish gupta <tomanishgupt...@gmail.com> 写道：
> >
> > Hi Dev,
> >
> > The current implementation of Blocklet dataMap caching in driver is that
> it
> > caches the min and max values of all the columns in schema by default.
> >
> > The problem with this implementation is that as the number of loads
> > increases the memory required to hold min and max values also increases
> > considerably. We know that in most of the scenarios there is a single
> > driver and memory configured for driver is less as compared to executor.
> > With continuos increase in memory requirement driver can even go out of
> > memory which makes the situation further worse.
> >
> > *Proposed Solution to solve the above problem:*
> >
> > Carbondata uses min and max values for blocklet level pruning. It might
> not
> > be necessary that user has filter on all the columns specified in the
> > schema instead it could be only few columns that has filter applied on
> them
> > in the query.
> >
> > 1. We provide user an option to cache the min and max values of only the
> > required columns. Caching only the required columns can optimize the
> > blocklet dataMap memory usage as well as solve the driver memory problem
> to
> > a greater extent.
> >
> > 2. Using an external storage/DB to cache min and max values. We can also
> > implement a solution to create a table in the external DB and store min
> and
> > max values for all the columns in that table. This will not use any
> driver
> > memory and hence the driver memory usage will be optimized further as
> > compared to solution 1.
> >
> > *Solution 1* will not have any performance impact as the user will cache
> > the required filter columns and it will not have any external dependency
> > for query execution.
> > *Solution 2* will degrade the query performance as it will involve
> querying
> > for min and max values from external DB required for Blocklet pruning.
> >
> > *So from my point of view we should go with solution 1 and in near future
> > propose a design for solution 2. User can have an option to select
> between
> > the 2 options*. Kindly share your suggestions.
> >
> > Regards
> > Manish Gupta
>
>
>
>

Re: [Discussion] Blocklet DataMap caching in driver

Reply via email to