Thank you, Manish. Is dictionary exclude supported for datatypes other than String? https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L706
- Swapnil On Wed, Jul 19, 2017 at 10:44 PM, manishgupta88 <tomanishgupt...@gmail.com> wrote: > Hi Swapnil > > Please find my answers inline. > > 1. What is the use of *carbon.number.of.cores *property and how is it > different from spark's executor cores? > > -carbon.number.of.cores is used for reading the footer and header of the > carbondata file during query execution. Spark executor cores is a property > of spark and controlled by spark for parallelizing the tasks. After task > distribution each task will further open the number of threads in parallel > specified as carbon.number.of.cores to read carbondata file footer and > header and it is managed by carbon code. > > 2. Documentation says, by default, all non-numeric columns (except complex > types) become dimensions and numeric columns become measure. How dimensions > and measure columns are handled diferently? What are the pros and cons of > keeping any column as dimension vs measure? > > - Dimensions will by default taking part in sorting the complete data from > left to right as well as because its a columnar storage each dimension will > further be sorted. On the other hand measure neither take part in sorting > the data nor they are individually sorted. > - Because dimensions are sorted it helps to get faster results for filter > queries by performing binary search. > > 3. What is the best way when we have a ID INT column which is will be used > heavily for filteration/agg/joins but can't be dimension by default. > Documentation says to include these kind of numeric columns with > "dictionay_include" or "dictionary_exclude" in table definition so that > column will be considered as dimenstion. It is not supported to keep > non-string data types as "dictionary_exclude" (link > <https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d1229068 > 0d388a43b3/integration/spark-common/src/main/scala/org/ > apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>) > Then do we have to enable dictionary encoding for ID INT columns which is > beneficial to encode. > > -- In the current system best way is to include the IT column as dictionary > include if the cardinality of column is less or dictionary exclude if > cardinality of column is high. Measure filter optimization has already been > implemented in branch 1.1 > (https://github.com/apache/carbondata/commits/branch-1.1) and will be > available in the coming releases (1.2 or 1.3). > For your reference you can go through the PR-1124 > (https://github.com/apache/carbondata/pull/1124) > > 4. How MDK gets generated and how can we alter it? Any API to find out MDK > for given table? > > -- Only dictionary Include columns take part in generation of MDKey. MDkey > is generated based on the cardinality of the column. It is one of the data > compression techniques to reduce the storage space in carbondata storage. > Computation example: > Number of bytes for each integer value - 4 > Total number of rows - 100000 > Total umber of bytes - 100000*4 > Cardinality of column(total number of unique values of a column) - 5 > As cardinality is only 5 and we store only the unique values for a > dictionary column, 5 unique values require total 3 bits for storage. But we > take minimum storage unit as byte so we can consider here 1 byte for > storing > 5 unique values. So we have reduced space from 4 byte to 1 byte for each > primitive integer value. This is the concept of MDKey. > > - You cannot alter an MDKey after table creation. MDKey will be created in > the order you have specified the dictionary columns during table creation. > > - For MDKey generation logic you can check the class > MultiDimKeyVarLengthGenerator > > Regards > Manish Gupta > > > > -- > View this message in context: http://apache-carbondata-dev- > mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts- > tp18438p18523.html > Sent from the Apache CarbonData Dev Mailing List archive mailing list > archive at Nabble.com. >