[jira] [Resolved] (CARBONDATA-1805) Optimize pruning for dictionary loading

Jacky Li (JIRA) Mon, 18 Dec 2017 00:26:37 -0800

     [ 
https://issues.apache.org/jira/browse/CARBONDATA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jacky Li resolved CARBONDATA-1805.
----------------------------------
    Resolution: Fixed

> Optimize pruning for dictionary loading
> ---------------------------------------
>
>                 Key: CARBONDATA-1805
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1805
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: data-load, spark-integration
>            Reporter: xuchuanyin
>            Assignee: xuchuanyin
>             Fix For: 1.3.0
>
>          Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> # SCENARIO
> Recently I tried dictionary feature in Carbondata and found its dictionary 
> generating phase in data loading is quite slow. My scenario is as below:
> + Input Data: 35.8GB CSV file with 199 columns and 126 Million lines
> + Dictionary columns: 3 columns each containing 19213,4,9 distinct values
> The whole data loading consumes about 2.9min for dictionary generating and 
> 4.6min for fact data loading -- about 39% of the time are spent on dictionary.
> Having observed the nmon result, Ifound the CPU usage were quite high during 
> the dictionary generating phase and the Disk, Network were quite normal.
> # ANALYZE
> After I went through the dictionary generating related code, I found 
> Carbondata aleady prune non-dictionary columns before generating dictionary. 
> But the problem is that `the pruning comes after data file reading`, this 
> will cause some overhead, we can optimize it by `prune while reading data 
> file`.
> # RESOLVE
> Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning 
> the non-dictionary columns while reading the data file.
> After implementing the above optimization, the dictionary generating costs 
> only `29s` -- `about 6 times better than before`(2.9min), and the fact data 
> loading costs the same as before(4.6min), about 10% of the time are spent on 
> dictionary.
> # NOTE
> + Currently only `load data file` will benefit from this optimization, while 
> `load data frame` will not.
> + Before implementing this solution, I tried another solution -- cache 
> dataframe of the data file, the performance was even worse -- the dictionary 
> generating time was 5.6min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (CARBONDATA-1805) Optimize pruning for dictionary loading

Reply via email to