Re: [DISCUSS] Apache CarbonData podling graduation

2017-03-02 Thread Bertrand Delacretaz
On Thu, Mar 2, 2017 at 7:49 AM, Jean-Baptiste Onofré  wrote:
> ...To prepare this discussion, I prepared a self-assessment against the
> Maturity Model:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=68714623
...

Thanks! This helps build confidence about graduating CarbonData, I'm
+1 for that.

-Bertrand


[jira] [Created] (CARBONDATA-743) Remove the abundant class CarbonFilters.scala

2017-03-02 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created CARBONDATA-743:
--

 Summary: Remove the abundant class CarbonFilters.scala
 Key: CARBONDATA-743
 URL: https://issues.apache.org/jira/browse/CARBONDATA-743
 Project: CarbonData
  Issue Type: Bug
Reporter: Ravindra Pesala
Priority: Trivial


Remove the abundant class CarbonFilters.scala from spark2 package.

Right now there are two classes with name CarbonFilters in carbondata.
1.Delete the CarbonFilters scala file from spark-common package
2. Move the CarbonFilters scala from spark2 package to spark-common package.
 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-739) Avoid creating multiple instances of DirectDictionary in DictionaryBasedResultCollector

2017-03-02 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created CARBONDATA-739:
--

 Summary: Avoid creating multiple instances of DirectDictionary in 
DictionaryBasedResultCollector
 Key: CARBONDATA-739
 URL: https://issues.apache.org/jira/browse/CARBONDATA-739
 Project: CarbonData
  Issue Type: Bug
  Components: core
Reporter: Ravindra Pesala
Priority: Minor


Avoid creating multiple instances of DirectDictionary in 
DictionaryBasedResultCollector.

For every row, direct dictionary is creating inside 
DictionaryBasedResultCollector.collectData method.

Please create single instance per column and reuse it



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] incubator-carbondata-site issue #18: Removed Redundant and unused files and ...

2017-03-02 Thread chenliang613
Github user chenliang613 commented on the issue:

https://github.com/apache/incubator-carbondata-site/pull/18
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Improving Non-dictionary storage & performance.

2017-03-02 Thread Jacky Li
Hi Ravindra,

Another suggestion is that, to avoid creating trouble for user while loading, 
for single-pass, if dictionary key generated for certain column is more than 
the configured value, then the loading process should stop and log this error 
explicitly telling the cardinality of all columns. 
By doing this, user should know what is the reason causing data load failure.
How about this idea?

Regards,
Jacky

> 在 2017年3月3日,上午1:26,Ravindra Pesala  写道:
> 
> Hi Likun,
> 
> Yes, Likun we better keep dictionary as default until we optimize
> no-dictionary columns.
> As you mentioned we can suggest 2-pass for first load and subsequent loads
> will use single-pass to improve the performance.
> 
> Regards,
> Ravindra.
> 
> On 2 March 2017 at 06:48, Jacky Li  wrote:
> 
>> Hi Ravindra & Vishal,
>> 
>> Yes, I think these works need to be done before switching no-dictionary as
>> default. So as of now, we should use dictionary as default.
>> I think we can suggest user to do loading as:
>> 1. First load: use 2-pass mode to load, the first scan should discover the
>> cardinality, and check with user specified option. We should define rules
>> to pass or fail the validation, and finalize the load option for subsequent
>> load.
>> 2. Subsequent load: use single-pass mode to load, use the options defined
>> by first load
>> 
>> What is your idea?
>> 
>> Regards,
>> Jacky
>> 
>>> 在 2017年3月1日,下午11:31,Ravindra Pesala  写道:
>>> 
>>> Hi Vishal,
>>> 
>>> You are right, thats why we can do no-dictionary only for String
>> datatype.
>>> Please look at my first point. we can always use direct dictionary for
>>> possible data types like short, int, long, double & float for
>> sort_columns.
>>> 
>>> Regards,
>>> Ravindra.
>>> 
>>> On 1 March 2017 at 18:18, Kumar Vishal 
>> wrote:
>>> 
 Hi Ravi,
 Sorting of data for no dictionary should be based on data type + same
>> for
 filter . Please add this point.
 
 -Regards
 Kumar Vishal
 
 On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala 
 wrote:
 
> Hi,
> 
> In order to make non-dictionary columns storage and performance more
> efficient, I am suggesting following improvements.
> 
> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> dictionary.
>  Right now only date and timestamp are direct dictionary columns. We
 can
> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
 columns
> are included in SORT_COLUMNS
> 
> 2. Consider delta/value compression while storing direct dictionary
 values.
> Right now it always uses INT datatype to store direct dictionary
>> values.
 So
> we can consider value/Delta compression to compact the storage.
> 
> 3. Use the Separator instead of LV format to store String value in
> no-dictionary format.
> Currently String datatypes for non-dictionary colums are stored as
> LV(length value) format, here we are using Short(2 bytes) as length
 always.
> In order to keep storage compact we can use separator (0 byte as
 separator)
> it just takes single byte. And while reading we can traverse through
>> data
> and get the offsets like we are doing now.
> 
> 4. Add Range filters for no-dictionary columns.
> Currently range filters like greater/ less than filters are not
 implemented
> for no-dictionary columns. So we should implement them to avoid row
>> level
> filter and improve the performance.
> 
> Regards,
> Ravindra.
> 
 
>>> 
>>> 
>>> --
>>> Thanks & Regards,
>>> Ravi
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi





Re: Improving Non-dictionary storage & performance.

2017-03-02 Thread Ravindra Pesala
Hi Likun,

Yes, Likun we better keep dictionary as default until we optimize
no-dictionary columns.
As you mentioned we can suggest 2-pass for first load and subsequent loads
will use single-pass to improve the performance.

Regards,
Ravindra.

On 2 March 2017 at 06:48, Jacky Li  wrote:

> Hi Ravindra & Vishal,
>
> Yes, I think these works need to be done before switching no-dictionary as
> default. So as of now, we should use dictionary as default.
> I think we can suggest user to do loading as:
> 1. First load: use 2-pass mode to load, the first scan should discover the
> cardinality, and check with user specified option. We should define rules
> to pass or fail the validation, and finalize the load option for subsequent
> load.
> 2. Subsequent load: use single-pass mode to load, use the options defined
> by first load
>
> What is your idea?
>
> Regards,
> Jacky
>
> > 在 2017年3月1日,下午11:31,Ravindra Pesala  写道:
> >
> > Hi Vishal,
> >
> > You are right, thats why we can do no-dictionary only for String
> datatype.
> > Please look at my first point. we can always use direct dictionary for
> > possible data types like short, int, long, double & float for
> sort_columns.
> >
> > Regards,
> > Ravindra.
> >
> > On 1 March 2017 at 18:18, Kumar Vishal 
> wrote:
> >
> >> Hi Ravi,
> >> Sorting of data for no dictionary should be based on data type + same
> for
> >> filter . Please add this point.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> In order to make non-dictionary columns storage and performance more
> >>> efficient, I am suggesting following improvements.
> >>>
> >>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
> >>> dictionary.
> >>>   Right now only date and timestamp are direct dictionary columns. We
> >> can
> >>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> >> columns
> >>> are included in SORT_COLUMNS
> >>>
> >>> 2. Consider delta/value compression while storing direct dictionary
> >> values.
> >>> Right now it always uses INT datatype to store direct dictionary
> values.
> >> So
> >>> we can consider value/Delta compression to compact the storage.
> >>>
> >>> 3. Use the Separator instead of LV format to store String value in
> >>> no-dictionary format.
> >>> Currently String datatypes for non-dictionary colums are stored as
> >>> LV(length value) format, here we are using Short(2 bytes) as length
> >> always.
> >>> In order to keep storage compact we can use separator (0 byte as
> >> separator)
> >>> it just takes single byte. And while reading we can traverse through
> data
> >>> and get the offsets like we are doing now.
> >>>
> >>> 4. Add Range filters for no-dictionary columns.
> >>> Currently range filters like greater/ less than filters are not
> >> implemented
> >>> for no-dictionary columns. So we should implement them to avoid row
> level
> >>> filter and improve the performance.
> >>>
> >>> Regards,
> >>> Ravindra.
> >>>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


-- 
Thanks & Regards,
Ravi


Re: [DISCUSS] For the dimension default should be no dictionary

2017-03-02 Thread bill.zhou
hi All
 I summary this discussion.
1. to make carbonData compatibility for older vesion, keep
DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not
suggestion change this properties to table_dictionary. 
2. Suggestion keep the sort_column properties as the same style for
dictionary. so this new properties suggestion use SORT_INCLUDE and
SORT_EXCLUDE, default is no sort.

Regards
Bill 


ravipesala wrote
> Hi All,
> 
> In order to make no-dictionary columns as default we should improve the
> storage and performance for these columns. I have sent another mail to
> discuss the improvement points. Please comment on it.
> 
> Regards,
> Ravindra
> 
> On 1 March 2017 at 10:12, Ravindra Pesala 

> ravi.pesala@

>  wrote:
> 
>> Hi Likun,
>>
>> It would be same case if we use all non dictionary columns by default, it
>> will increase the store size and decrease the performance so it is also
>> does not encourage more users if performance is poor.
>>
>> If we need to make no-dictionary columns as default then we should first
>> focus on reducing the store size and improve the filter queries on
>> non-dictionary columns.Even memory usage is higher while querying the
>> non-dictionary columns.
>>
>> Regards,
>> Ravindra.
>>
>> On 1 March 2017 at 06:00, Jacky Li 

> jacky.likun@

>  wrote:
>>
>>> Yes, I agree to your point. The only concern I have is for loading, I
>>> have seen many users accidentally put high cardinality column into
>>> dictionary column then the loading failed because out of memory or
>>> loading
>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>>> these columns, or they do not have a easy way to identify the high card
>>> columns. I feel preventing such misusage is important in order to
>>> encourage
>>> more users to use carbondata.
>>>
>>> Any suggestion on solving this issue?
>>>
>>>
>>> Regards,
>>> Likun
>>>
>>>
>>> > 在 2017年2月28日,下午10:20,Ravindra Pesala 

> ravi.pesala@

>  写道:
>>> >
>>> > Hi Likun,
>>> >
>>> > You mentioned that if user does not specify dictionary columns then by
>>> > default those are chosen as no dictionary columns.
>>> > But we have many disadvantages as I mentioned in above mail if you
>>> keep
>>> no
>>> > dictionary as default. We have initially introduced no dictionary
>>> columns
>>> > to handle high cardinality dimensions, but now making every thing as
>>> no
>>> > dictionary columns by default looses our unique feature compare to
>>> parquet.
>>> > Dictionary columns are introduced not only for aggregation queries, it
>>> is
>>> > for better compression and better filter queries as well. With out
>>> > dictionary store size will be increased a lot.
>>> >
>>> > Regards,
>>> > Ravindra.
>>> >
>>> > On 28 February 2017 at 18:05, Liang Chen 

> chenliang6136@

> 
>>> wrote:
>>> >
>>> >> Hi
>>> >>
>>> >> A couple of questions:
>>> >>
>>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>>> >> index" for these columns which be specified into the option(SORT_KEY)
>>> ?
>>> >>
>>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>>> make
>>> >> dictionary encoding, and all shuffle operations are based on fact
>>> value, is
>>> >> my understanding right ?
>>> >> 
>>> >> ---
>>> >> If this option is not specified by user, means all columns encoding
>>> without
>>> >> global dictionary support. Normal shuffle on decoded value will be
>>> applied
>>> >> when doing group by operation.
>>> >>
>>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified
>>> into
>>> >> TABLE_DICTIONARY, then system how to handle this case ?
>>> >> 
>>> >> ---
>>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>> encoded as
>>> >> Inverted Index and with Minmax Index
>>> >>
>>> >> Regards
>>> >> Liang
>>> >>
>>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li 

> jacky.likun@

> :
>>> >>
>>> >>> Yes, first we should simplify the DDL options. I propose following
>>> >> options,
>>> >>> please check weather it miss some scenario.
>>> >>>
>>> >>> 1. SORT_COLUMNS, or SORT_KEY
>>> >>> This indicates three things:
>>> >>> 1) All columns specified in options will be used to construct
>>> >>> Multi-Dimensional Key, which will be sorted along this key
>>> >>> 2) They will be encoded as Inverted Index and thus again sorted
>>> within
>>> >>> column chunk in one blocklet
>>> >>> 3) Minmax index will also be created for these columns
>>> >>>
>>> >>> When to use: This option is designed for accelerating filter query,
>>> so
>>> >> put
>>> >>> all filter columns into this option. The order of it can be:
>>> >>> 1) From low cardinality to high cardinality, this will make most
>>> >>> compression
>>> >>> and fit for