+1 Now user will have flexibility to choose the output format.Will get performance benefit if dictionary files are already generated.
-Regards Kumar Vishal On Fri, Dec 16, 2016 at 10:19 AM, Ravindra Pesala <[email protected]> wrote: > +1 to have separate output formats, now user can have flexibility to choose > as per scenario. > > On Fri, Dec 16, 2016, 2:47 AM Jihong Ma <[email protected]> wrote: > > > > > It is great idea to have separate OutputFormat for regular Carbon data > > files, index files as well as meta data files, For instance: dictionary > > file, schema file, global index file etc.. for writing Carbon generated > > files laid out HDFS, and it is orthogonal to the actual data load > process. > > > > Regards. > > > > Jihong > > > > -----Original Message----- > > From: Jacky Li [mailto:[email protected]] > > Sent: Thursday, December 15, 2016 12:55 AM > > To: [email protected] > > Subject: [DISCUSSION] CarbonData loading solution discussion > > > > > > Hi community, > > > > Since CarbonData has global dictionary feature, currently when loading > > data to CarbonData, it requires two times of scan of the input data. > First > > scan is to generate dictionary, second scan to do actual data encoding > and > > write to carbon files. Obviously, this approach is simple, but this > > approach has at least two problem: > > 1. involve unnecessary IO read. > > 2. need two jobs for MapReduce application to write carbon files > > > > To solve this, we need single-pass data loading solution, as discussed > > earlier, and now community is developing it (CARBONDATA-401, PR310). > > > > In this post, I want to discuss the OutputFormat part, I think there will > > be two OutputFormat for CarbonData. > > 1. DictionaryOutputFormat, which is used for the global dictionary > > generation. (This should be extracted from CarbonColumnDictGeneratRDD) > > 2. TableOutputFormat, which is used for writing CarbonData files. > > > > When carbon has these output formats, it is more easier to integrate with > > compute framework like spark, hive, mapreduce. > > And in order to make data loading faster, user can choose different > > solution based on its scenario as following > > Scenario 1: First load is small (can not cover most dictionary) > > > > run two jobs that use DictionaryOutputFormat and TableOutputFormat > > accordingly, in first few loads > > after some loads, it becomes like Scenario 2, run one job that use > > TableOutputFormat with single-pass > > Scenario 2: First load is big (can cover most dictionary) > > > > for first load > > if the bigest column cardinality > 10K, run two jobs using two output > > formats > > otherwise, run one job that use TableOutputFormat with single-pass > > for subsequent load, run one job that use TableOutputFormat with > > single-pass > > What do yo think this idea? > > > > Regards, > > Jacky > > >
