Re: Discussion(New feature) regarding single pass data loading solution.

Liang Chen Sun, 16 Oct 2016 18:45:43 -0700

+1 for this :
-------------------------
May be we can have a try, after all it will be just one interface 
implementation for dictionary generation. We can have multiple 
implementations and then decide based on optimal performance.


Regards
Liang


Ravindra Pesala wrote
> Hi Jacky/Jihong,
> 
> I agree that new dictionary values are less in case of incremental data
> load but that is completely depends on user data scenarios.  In some
> user scenarios new dictionary values may be more we cannot overrule that.
> And also for users convenience we should provide single pass solution with
> out insisting them to run external tool first. We can provide the option
> to
> run external tool first and provide dictionary to improve performance.
> 
> My opinion is better to use some professional distributed map like
> Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require
> to
> have separate cluster, it can form the cluster within the executor jvm's .
> May be we can have a try, after all it will be just one interface
> implementation for dictionary generation. We can have multiple
> implementations and then decide based on optimal performance.
> 
> Regards,
> Ravi
> 
> On 15 October 2016 at 10:50, Jacky Li &lt;

> jacky.likun@

> &gt; wrote:
> 
>> Hi,
>>
>> I can offer one more approach for this discussion, since new dictionary
>> values are rare in case of incremental load (ensure first load having as
>> much dictionary value as possible), so synchronization should be rare. So
>> how about using Zookeeper + HDFS file to provide this service. This is
>> what
>> carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
>> dictionary interface.
>> It has the benefit of
>> 1. automated: without bordering the user
>> 2. not introducing more dependency: we already using zookeeper and HDFS.
>> 3. performance? since new dictionary value and synchronization is rare.
>>
>> What do you think?
>>
>> Regards,
>> Jacky
>>
>> > 在 2016年10月15日，上午2:38，Jihong Ma &lt;

> Jihong.Ma@

> &gt; 写道：
>> >
>> > Hi Ravi,
>> >
>> > The major concern I have for generating global dictionary from scratch
>> with a single scan is performance, the way to handle an occasional update
>> to the dictionary is way simpler and cost effective in terms of
>> synchronization cost and refresh the global/local cache copy.
>> >
>> > There are a lot to worry about for distributed map, and leveraging KV
>> store is overkill if simply just for dictionary generation.
>> >
>> > Regards.
>> >
>> > Jihong
>> >
>> > -----Original Message-----
>> > From: Ravindra Pesala [mailto:

> ravi.pesala@

> ]
>> > Sent: Friday, October 14, 2016 11:03 AM
>> > To: dev
>> > Subject: Re: Discussion(New feature) regarding single pass data loading
>> solution.
>> >
>> > Hi Jihong,
>> >
>> > I agree, we can use external tool for first load, but for incremental
>> load
>> > we should have solution to add global dictionary. So this solution
>> should
>> > be enough to generate global dictionary even if user does not use
>> external
>> > tool for first time. That solution could be distributed map or KV
>> store.
>> >
>> > Regards,
>> > Ravi.
>> >
>> > On 14 October 2016 at 23:12, Jihong Ma &lt;

> Jihong.Ma@

> &gt; wrote:
>> >
>> >> Hi Liang,
>> >>
>> >> This tool is more or less like the first load, the first time after
>> table
>> >> is created, any subsequent loads/incremental loads will proceed and is
>> >> capable of updating the global dictionary when it encounters new
>> value,
>> >> this is easiest way of achieving 1 pass data loading process without
>> too
>> >> much overhead.
>> >>
>> >> Since this tool is only triggered once per table, not considered too
>> much
>> >> burden on the end users. Making global dictionary generation out of
>> the
>> way
>> >> of regular data loading is the key here.
>> >>
>> >> Jihong
>> >>
>> >> -----Original Message-----
>> >> From: Liang Chen [mailto:

> chenliang6136@

> ]
>> >> Sent: Thursday, October 13, 2016 5:39 PM
>> >> To: 

> [email protected]

>> >> Subject: RE: Discussion(New feature) regarding single pass data
>> loading
>> >> solution.
>> >>
>> >> Hi jihong
>> >>
>> >> I am not sure that users can accept to use extra tool to do this work,
>> >> because provide tool or do scan at first time per table for most of
>> global
>> >> dict are same cost from users perspective, and maintain the dict file
>> also
>> >> be same cost, they always expecting that system can automatically and
>> >> internally generate dict file during loading data.
>> >>
>> >> Can we consider this:
>> >> first load: make scan to generate most of global dict file, then copy
>> this
>> >> file to each load node for subsequent loading
>> >>
>> >> Regards
>> >> Liang
>> >>
>> >>
>> >> Jihong Ma wrote
>> >>>>>>> the question is what would be the default implementation? Load
>> data
>> >> without dictionary?
>> >>>
>> >>> My thought is we can provide a tool to generate global dictionary
>> using
>> >>> sample data set, so the initial global dictionaries is available
>> before
>> >>> normal data loading. We shall be able to perform encoding based on
>> that,
>> >>> we only need to handle occasionally adding entries while loading. For
>> >>> columns specified with global dictionary encoding, but dictionary is
>> not
>> >>> placed before data loading, we error out and direct user to use the
>> tool
>> >>> first.
>> >>>
>> >>> Make sense?
>> >>>
>> >>> Jihong
>> >>>
>> >>> -----Original Message-----
>> >>> From: Ravindra Pesala [mailto:
>> >>
>> >>> ravi.pesala@
>> >>
>> >>> ]
>> >>> Sent: Thursday, October 13, 2016 1:12 AM
>> >>> To: dev
>> >>> Subject: Re: Discussion(New feature) regarding single pass data
>> loading
>> >>> solution.
>> >>>
>> >>> Hi Jihong/Aniket,
>> >>>
>> >>> In the current implementation of carbondata we are already handling
>> >>> external dictionary while loading the data.
>> >>> But here the question is what would be the default implementation?
>> Load
>> >>> data with out dictionary?
>> >>>
>> >>>
>> >>> Regards,
>> >>> Ravi
>> >>>
>> >>> On 13 October 2016 at 03:50, Aniket Adnaik &lt;
>> >>
>> >>> aniket.adnaik@
>> >>
>> >>> &gt; wrote:
>> >>>
>> >>>> Hi Ravi,
>> >>>>
>> >>>> 1. I agree with Jihong that creation of global dictionary should be
>> >>>> optional, so that it can be disabled to improve the load
>> performance.
>> >>>> User
>> >>>> should be made aware that using global dictionary may boost the
>> query
>> >>>> performance.
>> >>>> 2. We should have a generic interface to manage global dictionary
>> when
>> >>>> its
>> >>>> from external sources. In general, it is not a good idea to depend
>> on
>> >> too
>> >>>> many external tools.
>> >>>> 3. May be we should allow user to generate global dictionary
>> separately
>> >>>> through SQL command or similar. Something like materialized view.
>> This
>> >>>> means carbon should avoid using local dictionary and do late
>> >>>> materialization when global dictionary is present.
>> >>>> 4. May be we should think of some ways to create global dictionary
>> >> lazily
>> >>>> as we serve SELECT queries. Implementation may not be that straight
>> >>>> forward. Not sure if its worth the effort.
>> >>>>
>> >>>> Best Regards,
>> >>>> Aniket
>> >>>>
>> >>>>
>> >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma &lt;
>> >>
>> >>> Jihong.Ma@
>> >>
>> >>> &gt; wrote:
>> >>>>
>> >>>>>
>> >>>>> A rather straight option is allow user to supply global dictionary
>> >>>>> generated somewhere else or we build a separate tool just for
>> >>>> generating
>> >>>> as
>> >>>>> well updating dictionary. Then the general normal data loading
>> process
>> >>>> will
>> >>>>> encode columns with local dictionary if not supplied.  This should
>> >>>> cover
>> >>>>> majority of cases for low-medium cardinality column. For the cases
>> we
>> >>>> have
>> >>>>> to incorporate online dictionary update, use a lock mechanism to
>> sync
>> >>>> up
>> >>>>> should serve the purpose.
>> >>>>>
>> >>>>> In another words, generating global dictionary is an optional step,
>> >>>> only
>> >>>>> triggered when needed, not a default step as we do currently.
>> >>>>>
>> >>>>> Jihong
>> >>>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Ravindra Pesala [mailto:
>> >>
>> >>> ravi.pesala@
>> >>
>> >>> ]
>> >>>>> Sent: Tuesday, October 11, 2016 2:33 AM
>> >>>>> To: dev
>> >>>>> Subject: Discussion(New feature) regarding single pass data loading
>> >>>>> solution.
>> >>>>>
>> >>>>> Hi All,
>> >>>>>
>> >>>>> This discussion is regarding single pass data load solution.
>> >>>>>
>> >>>>> Currently data is loading to carbon in 2 pass/jobs
>> >>>>> 1. Generating global dictionary using spark job.
>> >>>>> 2. Encode the data with dictionary values and create carbondata
>> >> files.
>> >>>>> This 2 pass solution has many disadvantages like it needs to read
>> the
>> >>>> data
>> >>>>> twice in case of csv files input or it needs to execute dataframe
>> >> twice
>> >>>> if
>> >>>>> data is loaded from dataframe.
>> >>>>>
>> >>>>> In order to overcome from above issues of 2 pass dataloading, we
>> can
>> >>>> have
>> >>>>> single pass dataloading and following are the alternate solutions.
>> >>>>>
>> >>>>> Use local dictionary
>> >>>>> Use local dictionary for each carbondata file while loading data,
>> but
>> >>>> it
>> >>>>> may lead to query performance degradation and more memory
>> footprint.
>> >>>>>
>> >>>>> Use KV store/distributed map.
>> >>>>> *HBase/Cassandra cluster : *
>> >>>>>  Dictionary data would be stored in KV store and generates the
>> >>>> dictionary
>> >>>>> value if it is not present in it. We all know the pros/cons of
>> Hbase
>> >>>> but
>> >>>>> following are few.
>> >>>>>  Pros : These are apache licensed
>> >>>>>         Easy to implement to store/retreive dictionary values.
>> >>>>>         Performance need to be evaluated.
>> >>>>>
>> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
>> >>>>> dictionary.
>> >>>>>
>> >>>>> *Hazlecast distributed map : *
>> >>>>>  Dictionary data could be saved in distributed concurrent hash map
>> of
>> >>>>> hazlecast. It is in-memory map and partioned as per number of
>> nodes.
>> >>>> And
>> >>>>> even we can maintain the backups using sync/async functionality to
>> >>>> avoid
>> >>>>> the data loss when instance is down. We no need to maintain
>> seperate
>> >>>>> cluster for it as it can run on executor jvm itself.
>> >>>>>  Pros: It is apache licensed.
>> >>>>>        No need to maintain seperate cluster as instances can run in
>> >>>>> executor jvms.
>> >>>>>        Easy to implement and store/retreive dictionary values.
>> >>>>>        It is pure java implementation.
>> >>>>>        There is no master/slave concept and no single point
>> failure.
>> >>>>>
>> >>>>>  Cons: Performance need to be evaluated.
>> >>>>>
>> >>>>> *Redis distributed map : *
>> >>>>>    It is also in-memory map but it is coded in c language so we
>> >> should
>> >>>>> have java client libraries to interact with redis. Need to maintain
>> >>>>> seperate cluster for it. It also can partition the data.
>> >>>>>  Pros : More feature rich than Hazlecast.
>> >>>>>         Easy to implement and store/retreive dictionary values.
>> >>>>>  Cons : Need to maintain seperate cluster for maintaining global
>> >>>>> dictionary.
>> >>>>>         May not be suitable for big data stack.
>> >>>>>         It is BSD licensed (Not sure whether we can use or not)
>> >>>>>  Online performance figures says it is little slower than
>> hazlecast.
>> >>>>>
>> >>>>> Please let me know which would be best fit for our loading
>> solution.
>> >>>> And
>> >>>>> please add any other suitable solution if I missed.
>> >>>>> --
>> >>>>> Thanks & Regards,
>> >>>>> Ravi
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Thanks & Regards,
>> >>> Ravi
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: http://apache-carbondata-
>> >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New-
>> >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html
>> >> Sent from the Apache CarbonData Mailing List archive mailing list
>> archive
>> >> at Nabble.com.
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>>
>>
>>
>>
> 
> 
> -- 
> Thanks & Regards,
> Ravi





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1984.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: Discussion(New feature) regarding single pass data loading solution.

Reply via email to