Re: [DISCUSS] Apache CarbonData podling graduation
On Thu, Mar 2, 2017 at 7:49 AM, Jean-Baptiste Onofréwrote: > ...To prepare this discussion, I prepared a self-assessment against the > Maturity Model: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=68714623 ... Thanks! This helps build confidence about graduating CarbonData, I'm +1 for that. -Bertrand
[jira] [Created] (CARBONDATA-743) Remove the abundant class CarbonFilters.scala
Ravindra Pesala created CARBONDATA-743: -- Summary: Remove the abundant class CarbonFilters.scala Key: CARBONDATA-743 URL: https://issues.apache.org/jira/browse/CARBONDATA-743 Project: CarbonData Issue Type: Bug Reporter: Ravindra Pesala Priority: Trivial Remove the abundant class CarbonFilters.scala from spark2 package. Right now there are two classes with name CarbonFilters in carbondata. 1.Delete the CarbonFilters scala file from spark-common package 2. Move the CarbonFilters scala from spark2 package to spark-common package. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (CARBONDATA-739) Avoid creating multiple instances of DirectDictionary in DictionaryBasedResultCollector
Ravindra Pesala created CARBONDATA-739: -- Summary: Avoid creating multiple instances of DirectDictionary in DictionaryBasedResultCollector Key: CARBONDATA-739 URL: https://issues.apache.org/jira/browse/CARBONDATA-739 Project: CarbonData Issue Type: Bug Components: core Reporter: Ravindra Pesala Priority: Minor Avoid creating multiple instances of DirectDictionary in DictionaryBasedResultCollector. For every row, direct dictionary is creating inside DictionaryBasedResultCollector.collectData method. Please create single instance per column and reuse it -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[GitHub] incubator-carbondata-site issue #18: Removed Redundant and unused files and ...
Github user chenliang613 commented on the issue: https://github.com/apache/incubator-carbondata-site/pull/18 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Improving Non-dictionary storage & performance.
Hi Ravindra, Another suggestion is that, to avoid creating trouble for user while loading, for single-pass, if dictionary key generated for certain column is more than the configured value, then the loading process should stop and log this error explicitly telling the cardinality of all columns. By doing this, user should know what is the reason causing data load failure. How about this idea? Regards, Jacky > 在 2017年3月3日,上午1:26,Ravindra Pesala写道: > > Hi Likun, > > Yes, Likun we better keep dictionary as default until we optimize > no-dictionary columns. > As you mentioned we can suggest 2-pass for first load and subsequent loads > will use single-pass to improve the performance. > > Regards, > Ravindra. > > On 2 March 2017 at 06:48, Jacky Li wrote: > >> Hi Ravindra & Vishal, >> >> Yes, I think these works need to be done before switching no-dictionary as >> default. So as of now, we should use dictionary as default. >> I think we can suggest user to do loading as: >> 1. First load: use 2-pass mode to load, the first scan should discover the >> cardinality, and check with user specified option. We should define rules >> to pass or fail the validation, and finalize the load option for subsequent >> load. >> 2. Subsequent load: use single-pass mode to load, use the options defined >> by first load >> >> What is your idea? >> >> Regards, >> Jacky >> >>> 在 2017年3月1日,下午11:31,Ravindra Pesala 写道: >>> >>> Hi Vishal, >>> >>> You are right, thats why we can do no-dictionary only for String >> datatype. >>> Please look at my first point. we can always use direct dictionary for >>> possible data types like short, int, long, double & float for >> sort_columns. >>> >>> Regards, >>> Ravindra. >>> >>> On 1 March 2017 at 18:18, Kumar Vishal >> wrote: >>> Hi Ravi, Sorting of data for no dictionary should be based on data type + same >> for filter . Please add this point. -Regards Kumar Vishal On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala wrote: > Hi, > > In order to make non-dictionary columns storage and performance more > efficient, I am suggesting following improvements. > > 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always direct > dictionary. > Right now only date and timestamp are direct dictionary columns. We can > make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these columns > are included in SORT_COLUMNS > > 2. Consider delta/value compression while storing direct dictionary values. > Right now it always uses INT datatype to store direct dictionary >> values. So > we can consider value/Delta compression to compact the storage. > > 3. Use the Separator instead of LV format to store String value in > no-dictionary format. > Currently String datatypes for non-dictionary colums are stored as > LV(length value) format, here we are using Short(2 bytes) as length always. > In order to keep storage compact we can use separator (0 byte as separator) > it just takes single byte. And while reading we can traverse through >> data > and get the offsets like we are doing now. > > 4. Add Range filters for no-dictionary columns. > Currently range filters like greater/ less than filters are not implemented > for no-dictionary columns. So we should implement them to avoid row >> level > filter and improve the performance. > > Regards, > Ravindra. > >>> >>> >>> -- >>> Thanks & Regards, >>> Ravi >> >> >> >> > > > -- > Thanks & Regards, > Ravi
Re: Improving Non-dictionary storage & performance.
Hi Likun, Yes, Likun we better keep dictionary as default until we optimize no-dictionary columns. As you mentioned we can suggest 2-pass for first load and subsequent loads will use single-pass to improve the performance. Regards, Ravindra. On 2 March 2017 at 06:48, Jacky Liwrote: > Hi Ravindra & Vishal, > > Yes, I think these works need to be done before switching no-dictionary as > default. So as of now, we should use dictionary as default. > I think we can suggest user to do loading as: > 1. First load: use 2-pass mode to load, the first scan should discover the > cardinality, and check with user specified option. We should define rules > to pass or fail the validation, and finalize the load option for subsequent > load. > 2. Subsequent load: use single-pass mode to load, use the options defined > by first load > > What is your idea? > > Regards, > Jacky > > > 在 2017年3月1日,下午11:31,Ravindra Pesala 写道: > > > > Hi Vishal, > > > > You are right, thats why we can do no-dictionary only for String > datatype. > > Please look at my first point. we can always use direct dictionary for > > possible data types like short, int, long, double & float for > sort_columns. > > > > Regards, > > Ravindra. > > > > On 1 March 2017 at 18:18, Kumar Vishal > wrote: > > > >> Hi Ravi, > >> Sorting of data for no dictionary should be based on data type + same > for > >> filter . Please add this point. > >> > >> -Regards > >> Kumar Vishal > >> > >> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala > >> wrote: > >> > >>> Hi, > >>> > >>> In order to make non-dictionary columns storage and performance more > >>> efficient, I am suggesting following improvements. > >>> > >>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always direct > >>> dictionary. > >>> Right now only date and timestamp are direct dictionary columns. We > >> can > >>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these > >> columns > >>> are included in SORT_COLUMNS > >>> > >>> 2. Consider delta/value compression while storing direct dictionary > >> values. > >>> Right now it always uses INT datatype to store direct dictionary > values. > >> So > >>> we can consider value/Delta compression to compact the storage. > >>> > >>> 3. Use the Separator instead of LV format to store String value in > >>> no-dictionary format. > >>> Currently String datatypes for non-dictionary colums are stored as > >>> LV(length value) format, here we are using Short(2 bytes) as length > >> always. > >>> In order to keep storage compact we can use separator (0 byte as > >> separator) > >>> it just takes single byte. And while reading we can traverse through > data > >>> and get the offsets like we are doing now. > >>> > >>> 4. Add Range filters for no-dictionary columns. > >>> Currently range filters like greater/ less than filters are not > >> implemented > >>> for no-dictionary columns. So we should implement them to avoid row > level > >>> filter and improve the performance. > >>> > >>> Regards, > >>> Ravindra. > >>> > >> > > > > > > -- > > Thanks & Regards, > > Ravi > > > > -- Thanks & Regards, Ravi
Re: [DISCUSS] For the dimension default should be no dictionary
hi All I summary this discussion. 1. to make carbonData compatibility for older vesion, keep DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not suggestion change this properties to table_dictionary. 2. Suggestion keep the sort_column properties as the same style for dictionary. so this new properties suggestion use SORT_INCLUDE and SORT_EXCLUDE, default is no sort. Regards Bill ravipesala wrote > Hi All, > > In order to make no-dictionary columns as default we should improve the > storage and performance for these columns. I have sent another mail to > discuss the improvement points. Please comment on it. > > Regards, > Ravindra > > On 1 March 2017 at 10:12, Ravindra Pesala > ravi.pesala@ > wrote: > >> Hi Likun, >> >> It would be same case if we use all non dictionary columns by default, it >> will increase the store size and decrease the performance so it is also >> does not encourage more users if performance is poor. >> >> If we need to make no-dictionary columns as default then we should first >> focus on reducing the store size and improve the filter queries on >> non-dictionary columns.Even memory usage is higher while querying the >> non-dictionary columns. >> >> Regards, >> Ravindra. >> >> On 1 March 2017 at 06:00, Jacky Li > jacky.likun@ > wrote: >> >>> Yes, I agree to your point. The only concern I have is for loading, I >>> have seen many users accidentally put high cardinality column into >>> dictionary column then the loading failed because out of memory or >>> loading >>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for >>> these columns, or they do not have a easy way to identify the high card >>> columns. I feel preventing such misusage is important in order to >>> encourage >>> more users to use carbondata. >>> >>> Any suggestion on solving this issue? >>> >>> >>> Regards, >>> Likun >>> >>> >>> > 在 2017年2月28日,下午10:20,Ravindra Pesala > ravi.pesala@ > 写道: >>> > >>> > Hi Likun, >>> > >>> > You mentioned that if user does not specify dictionary columns then by >>> > default those are chosen as no dictionary columns. >>> > But we have many disadvantages as I mentioned in above mail if you >>> keep >>> no >>> > dictionary as default. We have initially introduced no dictionary >>> columns >>> > to handle high cardinality dimensions, but now making every thing as >>> no >>> > dictionary columns by default looses our unique feature compare to >>> parquet. >>> > Dictionary columns are introduced not only for aggregation queries, it >>> is >>> > for better compression and better filter queries as well. With out >>> > dictionary store size will be increased a lot. >>> > >>> > Regards, >>> > Ravindra. >>> > >>> > On 28 February 2017 at 18:05, Liang Chen > chenliang6136@ > >>> wrote: >>> > >>> >> Hi >>> >> >>> >> A couple of questions: >>> >> >>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax >>> >> index" for these columns which be specified into the option(SORT_KEY) >>> ? >>> >> >>> >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't >>> make >>> >> dictionary encoding, and all shuffle operations are based on fact >>> value, is >>> >> my understanding right ? >>> >> >>> >> --- >>> >> If this option is not specified by user, means all columns encoding >>> without >>> >> global dictionary support. Normal shuffle on decoded value will be >>> applied >>> >> when doing group by operation. >>> >> >>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", >>> >> supposed if "C2" be specified into SORT_KEY, but not be specified >>> into >>> >> TABLE_DICTIONARY, then system how to handle this case ? >>> >> >>> >> --- >>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >>> encoded as >>> >> Inverted Index and with Minmax Index >>> >> >>> >> Regards >>> >> Liang >>> >> >>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li > jacky.likun@ > : >>> >> >>> >>> Yes, first we should simplify the DDL options. I propose following >>> >> options, >>> >>> please check weather it miss some scenario. >>> >>> >>> >>> 1. SORT_COLUMNS, or SORT_KEY >>> >>> This indicates three things: >>> >>> 1) All columns specified in options will be used to construct >>> >>> Multi-Dimensional Key, which will be sorted along this key >>> >>> 2) They will be encoded as Inverted Index and thus again sorted >>> within >>> >>> column chunk in one blocklet >>> >>> 3) Minmax index will also be created for these columns >>> >>> >>> >>> When to use: This option is designed for accelerating filter query, >>> so >>> >> put >>> >>> all filter columns into this option. The order of it can be: >>> >>> 1) From low cardinality to high cardinality, this will make most >>> >>> compression >>> >>> and fit for