Hi Manish,
Thanks for proposing the solutions of driver memory problem.

+1 for solution 1 but it may not be the complete solution. We should also
have solution 2  to solve driver memory issue completely. I think in a very
near feature we should have solution 2 as well.

I have few doubts and suggestions related to solution 1.
1. what if the query comes on noncached columns, will it start read from
disk in driver side for minmax ?
2. Are we planning to cache blocklet level information or block level
information in driver side for cached columns?
3. What is the impact if we automatically chose cached columns from the
user query instead of letting the user configure them?

Regards,
Ravindra.

On Thu, 21 Jun 2018 at 14:54, manish gupta <tomanishgupt...@gmail.com>
wrote:

> Hi Dev,
>
> The current implementation of Blocklet dataMap caching in driver is that it
> caches the min and max values of all the columns in schema by default.
>
> The problem with this implementation is that as the number of loads
> increases the memory required to hold min and max values also increases
> considerably. We know that in most of the scenarios there is a single
> driver and memory configured for driver is less as compared to executor.
> With continuos increase in memory requirement driver can even go out of
> memory which makes the situation further worse.
>
> *Proposed Solution to solve the above problem:*
>
> Carbondata uses min and max values for blocklet level pruning. It might not
> be necessary that user has filter on all the columns specified in the
> schema instead it could be only few columns that has filter applied on them
> in the query.
>
> 1. We provide user an option to cache the min and max values of only the
> required columns. Caching only the required columns can optimize the
> blocklet dataMap memory usage as well as solve the driver memory problem to
> a greater extent.
>
> 2. Using an external storage/DB to cache min and max values. We can also
> implement a solution to create a table in the external DB and store min and
> max values for all the columns in that table. This will not use any driver
> memory and hence the driver memory usage will be optimized further as
> compared to solution 1.
>
> *Solution 1* will not have any performance impact as the user will cache
> the required filter columns and it will not have any external dependency
> for query execution.
> *Solution 2* will degrade the query performance as it will involve querying
> for min and max values from external DB required for Blocklet pruning.
>
> *So from my point of view we should go with solution 1 and in near future
> propose a design for solution 2. User can have an option to select between
> the 2 options*. Kindly share your suggestions.
>
> Regards
> Manish Gupta
>


-- 
Thanks & Regards,
Ravi

Reply via email to