GitHub user manishgupta88 opened a pull request:
https://github.com/apache/carbondata/pull/1077
[CARBONDATA-1213] Removed rowCountPercentage check and fixed IUD data load
issue
Problems:
1. Row count percentage not required with high cardinality threshold check
2. IUD returning incorrect results in case of update on high cardinality
column
Analysis:
1. In case a column is identified as high cardinality column still it is
not getting converted to no dictionary column because of another parameter
check called rowCountPercentage. Default value of rowCountPercentage is 80%.
Due to this even though high cardinality column is identified, if it is less
than 80% of the total number of rows it will be treated as dictionary column.
This can still lead to executor lost failure due to memory constraints.
2. RLE on a column is not being set correctly and due to incorrect code
design RLE applicable on a column is decided by a different part of code from
the one which is actually applying the RLE on a column. Because of this Footer
is getting filled with incorrect RLE information and query is failing.
Fix:
1. Remove an unwanted check for rowCountPercentage.
2. RLE applicability on a column should be decided from a common place in
the code.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/manishgupta88/incubator-carbondata
high_cardinlaity_identification_fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1077.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1077
----
commit 9c1291dc84b8fc4247a9d6e32d4482685d40325a
Author: manishgupta88 <[email protected]>
Date: 2017-06-22T09:07:13Z
Problem:
1. Row count percentage not required with high cardinality threshold check
2. IUD returning incorrect results in case of update on high cardinality
column
Analysis:
1. In case a column is identified as high cardinality column still it is
not getting converted to no dictionary column because of another parameter
check called rowCountPercentage. Default value of rowCountPercentage is 80%.
Due to this even though high cardinality column is identified, if it is less
than 80% of the total number of rows it will be treated as dictionary column.
This can still lead to executor lost failure due to memory constraints.
2. RLE on a column is not being set correctly and due to incorrect code
design RLE applicable on a column is decided by a different part of code from
the one which is actually applying the RLE on a column. Because of this Footer
is getting filled with incorrect RLE information and query is failing.
Fix:
1. Remove an unwanted check for rowCountPercentage.
2. RLE applicability on a column should be decided from a common place in
the code.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---