Hi community, 

Sorry for the incorrect formatting of previous post. I corrected it in this
post.

Since CarbonData has global dictionary feature, currently when loading data
to CarbonData, it requires two times of scan of the input data. First scan
is to generate dictionary, second scan to do actual data encoding and write
to carbon files. Obviously, this approach is simple, but this approach has
at least two problem: 
1. involve unnecessary IO read. 
2. need two jobs for MapReduce application to write carbon files 

To solve this, we need single-pass data loading solution, as discussed
earlier, and now community is developing it (CARBONDATA-401, PR310). 

In this post, I want to discuss the OutputFormat part, I think there will be
two OutputFormat for CarbonData. 
1. DictionaryOutputFormat, which is used for the global dictionary
generation. (This should be extracted from CarbonColumnDictGeneratRDD) 
2. TableOutputFormat, which is used for writing CarbonData files. 

When carbon has these output formats, it is more easier to integrate with
compute framework like spark, hive, mapreduce. 
And in order to make data loading faster, user can choose different solution
based on its scenario as following:

Scenario 1:  First load is small (can not cover most dictionary) 
1) for first few loads
run two jobs that use DictionaryOutputFormat and TableOutputFormat
accordingly
 
2) after some loads
It becomes like Scenario 2, so user can just run one job that use
TableOutputFormat with single-pass support

Scenario 2: First load is big (can cover most dictionary) 
1) for first load 
If the bigest column cardinality > 10K, run two jobs using two output
formats. Otherwise, run one job that use TableOutputFormat with single-pass
support

2) for subsequent load
Run one job that use TableOutputFormat with single-pass support

What do yo think this idea? 

Regards, 
Jacky



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4491.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Reply via email to