Hi Jacky

Thanks you started a good discussion.

see if i understand your points:
Scenario1 likes the current load data solution(0.2.0). 1.0.0 Will provide a
new solution option of "single-pass data loading" to meet this kind of
scenario: For subsequent data loads if the most dictionary code has been
built, then can add "single-pass data loading" option to the command of data
load to reduce scan(can improve performance). 

+1 to add the solution "single-pass data loading" if my understanding is
correct.

Regards
Liang


Jacky Li wrote
> Hi community, 
> 
> Sorry for the incorrect formatting of previous post. I corrected it in
> this post.
> 
> Since CarbonData has global dictionary feature, currently when loading
> data to CarbonData, it requires two times of scan of the input data. First
> scan is to generate dictionary, second scan to do actual data encoding and
> write to carbon files. Obviously, this approach is simple, but this
> approach has at least two problem: 
> 1. involve unnecessary IO read. 
> 2. need two jobs for MapReduce application to write carbon files 
> 
> To solve this, we need single-pass data loading solution, as discussed
> earlier, and now community is developing it (CARBONDATA-401, PR310). 
> 
> In this post, I want to discuss the OutputFormat part, I think there will
> be two OutputFormat for CarbonData. 
> 1. DictionaryOutputFormat, which is used for the global dictionary
> generation. (This should be extracted from CarbonColumnDictGeneratRDD) 
> 2. TableOutputFormat, which is used for writing CarbonData files. 
> 
> When carbon has these output formats, it is more easier to integrate with
> compute framework like spark, hive, mapreduce. 
> And in order to make data loading faster, user can choose different
> solution based on its scenario as following:
> 
> Scenario 1:  First load is small (can not cover most dictionary) 
> 1) for first few loads
> run two jobs that use DictionaryOutputFormat and TableOutputFormat
> accordingly
>  
> 2) after some loads
> It becomes like Scenario 2, so user can just run one job that use
> TableOutputFormat with single-pass support
> 
> Scenario 2: First load is big (can cover most dictionary) 
> 1) for first load 
> If the bigest column cardinality > 10K, run two jobs using two output
> formats. Otherwise, run one job that use TableOutputFormat with
> single-pass support
> 
> 2) for subsequent load
> Run one job that use TableOutputFormat with single-pass support
> 
> What do yo think this idea? 
> 
> Regards, 
> Jacky





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4509.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Reply via email to