GitHub user xuchuanyin opened a pull request:
https://github.com/apache/carbondata/pull/1559
[CARBONDATA-1805][Dictionary] Optimize pruning for dictionary loading
Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:
- [X] Any interfaces changed?
`NO`
- [X] Any backward compatibility impacted?
`NO`
- [X] Document update required?
`NO`
- [X] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests
are required?
`NO TESTS ADDED, PERFORMANCE ENHANCEMENT DIDN'T AFFECT THE
FUNCTIONALITY`
- How it is tested? Please attach test report.
`TESTED IN CLUSTER WITH REAL DATA`
- Is it a performance related change? Please attach the performance
test report.
`PERFORMANCE ENHANCED, DICTIONARY TIME REDUCED FROM 2.9MIN TO 29SEC`
- Any additional information to help reviewers in testing this
change.
`NO`
- [X] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
`NOT RELATED`
COPY FROM JIRA
===
# SCENARIO
Recently I have tried dictionary feature in Carbondata and found its
dictionary generating phase in data loading is quite slow. My scenario is as
below:
+ Input Data: 35.8GB CSV file with 199 columns and 126 Million lines
+ Dictionary columns: 3 columns each containing 19213,4,9 distinct values
The whole data loading consumes about 2.9min for dictionary generating and
4.6min for fact data loading -- about 39% of the time are spent on dictionary.
Having observed the nmon result, Ifound the CPU usage were quite high
during the dictionary generating phase and the Disk, Network were quite normal.
# ANALYZE
After I went through the dictionary generating related code, I found
Carbondata aleady prune non-dictionary columns before generating dictionary.
But the problem is that `the pruning comes after data file reading`, this will
cause some overhead, we can optimize it by `prune while reading data file`.
# RESOLVE
Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning
the non-dictionary columns while reading the data file.
After implementing the above optimization, the dictionary generating costs
only `29s` -- **`about 6 times better than before`**(2.9min), and the fact data
loading costs the same as before(4.6min), about 10% of the time are spent on
dictionary.
# NOTE
+ Currently only `load data file` will benefit from this optimization,
while `load data frame` will not.
+ Before implementing this solution, I tried another solution -- cache
dataframe of the data file, the performance was even worse -- the dictionary
generating time was 5.6min.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/xuchuanyin/carbondata opt_dict_load
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/1559.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1559
----
commit e8e49ed54085700eadde81842af0b0daecaed12a
Author: xuchuanyin <[email protected]>
Date: 2017-11-24T03:27:02Z
optimize pruning for dictionary loading
----
---