Hi Jacky Thanks you started a good discussion.
see if i understand your points: Scenario1 likes the current load data solution(0.2.0). 1.0.0 Will provide a new solution option of "single-pass data loading" to meet this kind of scenario: For subsequent data loads if the most dictionary code has been built, then can add "single-pass data loading" option to the command of data load to reduce scan(can improve performance). +1 to add the solution "single-pass data loading" if my understanding is correct. Regards Liang Jacky Li wrote > Hi community, > > Sorry for the incorrect formatting of previous post. I corrected it in > this post. > > Since CarbonData has global dictionary feature, currently when loading > data to CarbonData, it requires two times of scan of the input data. First > scan is to generate dictionary, second scan to do actual data encoding and > write to carbon files. Obviously, this approach is simple, but this > approach has at least two problem: > 1. involve unnecessary IO read. > 2. need two jobs for MapReduce application to write carbon files > > To solve this, we need single-pass data loading solution, as discussed > earlier, and now community is developing it (CARBONDATA-401, PR310). > > In this post, I want to discuss the OutputFormat part, I think there will > be two OutputFormat for CarbonData. > 1. DictionaryOutputFormat, which is used for the global dictionary > generation. (This should be extracted from CarbonColumnDictGeneratRDD) > 2. TableOutputFormat, which is used for writing CarbonData files. > > When carbon has these output formats, it is more easier to integrate with > compute framework like spark, hive, mapreduce. > And in order to make data loading faster, user can choose different > solution based on its scenario as following: > > Scenario 1: First load is small (can not cover most dictionary) > 1) for first few loads > run two jobs that use DictionaryOutputFormat and TableOutputFormat > accordingly > > 2) after some loads > It becomes like Scenario 2, so user can just run one job that use > TableOutputFormat with single-pass support > > Scenario 2: First load is big (can cover most dictionary) > 1) for first load > If the bigest column cardinality > 10K, run two jobs using two output > formats. Otherwise, run one job that use TableOutputFormat with > single-pass support > > 2) for subsequent load > Run one job that use TableOutputFormat with single-pass support > > What do yo think this idea? > > Regards, > Jacky -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4509.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
