Thanks for the feedback Jacky. As of now we have min/max at each block and blocklet level and while loading the metadata cache we compute the task level min/max. Segment Level min/max is not considered as of now but surely this solution can be enhanced to consider segment level min/max.
We can discuss further on this in detail and decide whether to consider now or enhance it in near future. Regards Manish Gupta On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <jacky.li...@qq.com> wrote: > Hi Manish, > > +1 for solution 1 for next carbon version. Solution 2 should be also > considered, but for a future version after next version. > > In my previous observation, many scenario user will filter on time range, > and since Carbon’s segment is per incremental load which makes it related > to time normally. So if we can have minmax for sort_columns for segment > level, I think it will further help making driver index minimum. Will you > also consider this? > > Regards, > Jacky > > > > 在 2018年6月21日,下午5:24,manish gupta <tomanishgupt...@gmail.com> 写道: > > > > Hi Dev, > > > > The current implementation of Blocklet dataMap caching in driver is that > it > > caches the min and max values of all the columns in schema by default. > > > > The problem with this implementation is that as the number of loads > > increases the memory required to hold min and max values also increases > > considerably. We know that in most of the scenarios there is a single > > driver and memory configured for driver is less as compared to executor. > > With continuos increase in memory requirement driver can even go out of > > memory which makes the situation further worse. > > > > *Proposed Solution to solve the above problem:* > > > > Carbondata uses min and max values for blocklet level pruning. It might > not > > be necessary that user has filter on all the columns specified in the > > schema instead it could be only few columns that has filter applied on > them > > in the query. > > > > 1. We provide user an option to cache the min and max values of only the > > required columns. Caching only the required columns can optimize the > > blocklet dataMap memory usage as well as solve the driver memory problem > to > > a greater extent. > > > > 2. Using an external storage/DB to cache min and max values. We can also > > implement a solution to create a table in the external DB and store min > and > > max values for all the columns in that table. This will not use any > driver > > memory and hence the driver memory usage will be optimized further as > > compared to solution 1. > > > > *Solution 1* will not have any performance impact as the user will cache > > the required filter columns and it will not have any external dependency > > for query execution. > > *Solution 2* will degrade the query performance as it will involve > querying > > for min and max values from external DB required for Blocklet pruning. > > > > *So from my point of view we should go with solution 1 and in near future > > propose a design for solution 2. User can have an option to select > between > > the 2 options*. Kindly share your suggestions. > > > > Regards > > Manish Gupta > > > >