Re: carbon data performance doubts

manishgupta88 Wed, 19 Jul 2017 19:46:05 -0700

Hi Swapnil

Please find my answers inline.

1. What is the use of *carbon.number.of.cores *property and how is it
different from spark's executor cores?

-carbon.number.of.cores is used for reading the footer and header of the
carbondata file during query execution. Spark executor cores is a property
of spark and controlled by spark for parallelizing the tasks. After task
distribution each task will further open the number of threads in parallel
specified as carbon.number.of.cores to read carbondata file footer and
header and it is managed by carbon code.

2. Documentation says, by default, all non-numeric columns (except complex
types) become dimensions and numeric columns become measure. How dimensions
and measure columns are handled diferently? What are the pros and cons of
keeping any column as dimension vs measure?

- Dimensions will by default taking part in sorting the complete data from
left to right as well as because its a columnar storage each dimension will
further be sorted. On the other hand measure neither take part in sorting
the data nor they are individually sorted.
- Because dimensions are sorted it helps to get faster results for filter
queries by performing binary search.

3. What is the best way when we have a ID INT column which is will be used
heavily for filteration/agg/joins but can't be dimension by default.
Documentation says to include these kind of numeric columns with
"dictionay_include" or "dictionary_exclude" in table definition so that
column will be considered as dimenstion. It is not supported to keep
non-string data types as "dictionary_exclude" (link
<https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>)
Then do we have to enable dictionary encoding for ID INT columns which is
beneficial to encode.

-- In the current system best way is to include the IT column as dictionary
include if the cardinality of column is less or dictionary exclude if
cardinality of column is high. Measure filter optimization has already been
implemented in branch 1.1
(https://github.com/apache/carbondata/commits/branch-1.1) and will be
available in the coming releases (1.2 or 1.3).
For your reference you can go through the PR-1124
(https://github.com/apache/carbondata/pull/1124)

4. How MDK gets generated and how can we alter it? Any API to find out MDK
for given table?

-- Only dictionary Include columns take part in generation of MDKey. MDkey
is generated based on the cardinality of the column. It is one of the data
compression techniques to reduce the storage space in carbondata storage.
Computation example:
Number of bytes for each integer value - 4
Total number of rows - 100000
Total umber of bytes - 100000*4
Cardinality of column(total number of unique values of a column) - 5
As cardinality is only 5 and we store only the unique values for a
dictionary column, 5 unique values require total 3 bits for storage. But we
take minimum storage unit as byte so we can consider here 1 byte for storing
5 unique values. So we have reduced space from 4 byte to 1 byte for each
primitive integer value. This is the concept of MDKey.

- You cannot alter an MDKey after table creation. MDKey will be created in
the order you have specified the dictionary columns during table creation.

- For MDKey generation logic you can check the class
MultiDimKeyVarLengthGenerator

Regards
Manish Gupta

--
View this message in context:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-tp18438p18523.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive
at Nabble.com.

Re: carbon data performance doubts

Reply via email to