xuchuanyin created CARBONDATA-2091:
--------------------------------------

             Summary: Enhance data loading performance by specifying range 
bounds for sort columns
                 Key: CARBONDATA-2091
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2091
             Project: CarbonData
          Issue Type: Improvement
            Reporter: xuchuanyin
            Assignee: xuchuanyin


Currently in carbondata, data loading using node_sort (also known as 
local_sort) has the following procedures:
 # convert the input data in batch. (*Convert*)
 # sort the batch and write to the sort temp files. (*TempSort*)
 # combine the sort temp files and do merge sort to get a bigger ordered sort 
temp file. (*MergeSort*)
 # combine all the sort temp files and do a final sort, its results will feed 
the next procedure. (*FinalSort*)
 # get rows in order and convert rows to carbondata columnar format pages. 
(*produce*)
 # Write bundles of pages to files and write the corresponding index file. 
(*consume*)

The Step1~Step3 are done concurrently using multi-thread. The Step4 is done 
using only one thread. The Step5 is done using multi-thread. So the Step4 is 
the bottleneck among all the procedures. When observing the data loading 
performance, we can see that the CPU usage after Step3 is low.

 

We can enhance the data loading performance by parallelizing Step4.

 

User can specify range bounds for the sort columns and carbondata internally 
distributes the records to different ranges and process the data concurrently 
in different ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to