+1 to have separate output formats, now user can have flexibility to choose
as per scenario.

On Fri, Dec 16, 2016, 2:47 AM Jihong Ma <jihong...@huawei.com> wrote:

> It is great idea to have separate OutputFormat for regular Carbon data
> files, index files as well as meta data files, For instance: dictionary
> file, schema file, global index file etc.. for writing Carbon generated
> files laid out HDFS, and it is orthogonal to the actual data load process.
> Regards.
> Jihong
> -----Original Message-----
> From: Jacky Li [mailto:jacky.li...@qq.com]
> Sent: Thursday, December 15, 2016 12:55 AM
> To: dev@carbondata.incubator.apache.org
> Subject: [DISCUSSION] CarbonData loading solution discussion
> Hi community,
> Since CarbonData has global dictionary feature, currently when loading
> data to CarbonData, it requires two times of scan of the input data. First
> scan is to generate dictionary, second scan to do actual data encoding and
> write to carbon files. Obviously, this approach is simple, but this
> approach has at least two problem:
> 1. involve unnecessary IO read.
> 2. need two jobs for MapReduce application to write carbon files
> To solve this, we need single-pass data loading solution, as discussed
> earlier, and now community is developing it (CARBONDATA-401, PR310).
> In this post, I want to discuss the OutputFormat part, I think there will
> be two OutputFormat for CarbonData.
> 1. DictionaryOutputFormat, which is used for the global dictionary
> generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> 2. TableOutputFormat, which is used for writing CarbonData files.
> When carbon has these output formats, it is more easier to integrate with
> compute framework like spark, hive, mapreduce.
> And in order to make data loading faster, user can choose different
> solution based on its scenario as following
> Scenario 1:  First load is small (can not cover most dictionary)
> run two jobs that use DictionaryOutputFormat and TableOutputFormat
> accordingly, in first few loads
> after some loads, it becomes like Scenario 2, run one job that use
> TableOutputFormat with single-pass
> Scenario 2: First load is big (can cover most dictionary)
> for first load
> if the bigest column cardinality > 10K, run two jobs using two output
> formats
> otherwise, run one job that use TableOutputFormat with single-pass
> for subsequent load, run one job that use TableOutputFormat with
> single-pass
> What do yo think this idea?
> Regards,
> Jacky

Reply via email to