Re: Improving Non-dictionary storage & performance.

bill.zhou Tue, 07 Mar 2017 09:04:12 -0800

hi Jacky 
    I think this is not easy for user to control if cabron is online
running. May be for one table two different load can be different
cardinality for the same column but user cannot give different dictionary
columns for one table.


Regards


Jacky Li wrote
> Hi Ravindra,
> 
> Another suggestion is that, to avoid creating trouble for user while
> loading, for single-pass, if dictionary key generated for certain column
> is more than the configured value, then the loading process should stop
> and log this error explicitly telling the cardinality of all columns. 
> By doing this, user should know what is the reason causing data load
> failure.
> How about this idea?
> 
> Regards,
> Jacky
> 
>> 在 2017年3月3日，上午1:26，Ravindra Pesala &lt;

> ravi.pesala@

> &gt; 写道：
>> 
>> Hi Likun,
>> 
>> Yes, Likun we better keep dictionary as default until we optimize
>> no-dictionary columns.
>> As you mentioned we can suggest 2-pass for first load and subsequent
>> loads
>> will use single-pass to improve the performance.
>> 
>> Regards,
>> Ravindra.
>> 
>> On 2 March 2017 at 06:48, Jacky Li &lt;

> jacky.likun@

> &gt; wrote:
>> 
>>> Hi Ravindra & Vishal,
>>> 
>>> Yes, I think these works need to be done before switching no-dictionary
>>> as
>>> default. So as of now, we should use dictionary as default.
>>> I think we can suggest user to do loading as:
>>> 1. First load: use 2-pass mode to load, the first scan should discover
>>> the
>>> cardinality, and check with user specified option. We should define
>>> rules
>>> to pass or fail the validation, and finalize the load option for
>>> subsequent
>>> load.
>>> 2. Subsequent load: use single-pass mode to load, use the options
>>> defined
>>> by first load
>>> 
>>> What is your idea?
>>> 
>>> Regards,
>>> Jacky
>>> 
>>>> 在 2017年3月1日，下午11:31，Ravindra Pesala &lt;

> ravi.pesala@

> &gt; 写道：
>>>> 
>>>> Hi Vishal,
>>>> 
>>>> You are right, thats why we can do no-dictionary only for String
>>> datatype.
>>>> Please look at my first point. we can always use direct dictionary for
>>>> possible data types like short, int, long, double & float for
>>> sort_columns.
>>>> 
>>>> Regards,
>>>> Ravindra.
>>>> 
>>>> On 1 March 2017 at 18:18, Kumar Vishal &lt;

> kumarvishal1802@

> &gt;
>>> wrote:
>>>> 
>>>>> Hi Ravi,
>>>>> Sorting of data for no dictionary should be based on data type + same
>>> for
>>>>> filter . Please add this point.
>>>>> 
>>>>> -Regards
>>>>> Kumar Vishal
>>>>> 
>>>>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala &lt;

> ravi.pesala@

> &gt;
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> In order to make non-dictionary columns storage and performance more
>>>>>> efficient, I am suggesting following improvements.
>>>>>> 
>>>>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
>>>>>> dictionary.
>>>>>>  Right now only date and timestamp are direct dictionary columns. We
>>>>> can
>>>>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>>>>> columns
>>>>>> are included in SORT_COLUMNS
>>>>>> 
>>>>>> 2. Consider delta/value compression while storing direct dictionary
>>>>> values.
>>>>>> Right now it always uses INT datatype to store direct dictionary
>>> values.
>>>>> So
>>>>>> we can consider value/Delta compression to compact the storage.
>>>>>> 
>>>>>> 3. Use the Separator instead of LV format to store String value in
>>>>>> no-dictionary format.
>>>>>> Currently String datatypes for non-dictionary colums are stored as
>>>>>> LV(length value) format, here we are using Short(2 bytes) as length
>>>>> always.
>>>>>> In order to keep storage compact we can use separator (0 byte as
>>>>> separator)
>>>>>> it just takes single byte. And while reading we can traverse through
>>> data
>>>>>> and get the offsets like we are doing now.
>>>>>> 
>>>>>> 4. Add Range filters for no-dictionary columns.
>>>>>> Currently range filters like greater/ less than filters are not
>>>>> implemented
>>>>>> for no-dictionary columns. So we should implement them to avoid row
>>> level
>>>>>> filter and improve the performance.
>>>>>> 
>>>>>> Regards,
>>>>>> Ravindra.
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> Ravi
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Thanks & Regards,
>> Ravi





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-performance-tp8146p8402.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: Improving Non-dictionary storage & performance.

Reply via email to