+1 for this : ------------------------- May be we can have a try, after all it will be just one interface implementation for dictionary generation. We can have multiple implementations and then decide based on optimal performance.
Regards Liang Ravindra Pesala wrote > Hi Jacky/Jihong, > > I agree that new dictionary values are less in case of incremental data > load but that is completely depends on user data scenarios. In some > user scenarios new dictionary values may be more we cannot overrule that. > And also for users convenience we should provide single pass solution with > out insisting them to run external tool first. We can provide the option > to > run external tool first and provide dictionary to improve performance. > > My opinion is better to use some professional distributed map like > Hazlecast than Zookeeper + HDFS. It is lite weight and does not require > to > have separate cluster, it can form the cluster within the executor jvm's . > May be we can have a try, after all it will be just one interface > implementation for dictionary generation. We can have multiple > implementations and then decide based on optimal performance. > > Regards, > Ravi > > On 15 October 2016 at 10:50, Jacky Li < > jacky.likun@ > > wrote: > >> Hi, >> >> I can offer one more approach for this discussion, since new dictionary >> values are rare in case of incremental load (ensure first load having as >> much dictionary value as possible), so synchronization should be rare. So >> how about using Zookeeper + HDFS file to provide this service. This is >> what >> carbon is doing today, we can wrap Zookeeper + HDFS to provide the global >> dictionary interface. >> It has the benefit of >> 1. automated: without bordering the user >> 2. not introducing more dependency: we already using zookeeper and HDFS. >> 3. performance? since new dictionary value and synchronization is rare. >> >> What do you think? >> >> Regards, >> Jacky >> >> > 在 2016年10月15日,上午2:38,Jihong Ma < > Jihong.Ma@ > > 写道: >> > >> > Hi Ravi, >> > >> > The major concern I have for generating global dictionary from scratch >> with a single scan is performance, the way to handle an occasional update >> to the dictionary is way simpler and cost effective in terms of >> synchronization cost and refresh the global/local cache copy. >> > >> > There are a lot to worry about for distributed map, and leveraging KV >> store is overkill if simply just for dictionary generation. >> > >> > Regards. >> > >> > Jihong >> > >> > -----Original Message----- >> > From: Ravindra Pesala [mailto: > ravi.pesala@ > ] >> > Sent: Friday, October 14, 2016 11:03 AM >> > To: dev >> > Subject: Re: Discussion(New feature) regarding single pass data loading >> solution. >> > >> > Hi Jihong, >> > >> > I agree, we can use external tool for first load, but for incremental >> load >> > we should have solution to add global dictionary. So this solution >> should >> > be enough to generate global dictionary even if user does not use >> external >> > tool for first time. That solution could be distributed map or KV >> store. >> > >> > Regards, >> > Ravi. >> > >> > On 14 October 2016 at 23:12, Jihong Ma < > Jihong.Ma@ > > wrote: >> > >> >> Hi Liang, >> >> >> >> This tool is more or less like the first load, the first time after >> table >> >> is created, any subsequent loads/incremental loads will proceed and is >> >> capable of updating the global dictionary when it encounters new >> value, >> >> this is easiest way of achieving 1 pass data loading process without >> too >> >> much overhead. >> >> >> >> Since this tool is only triggered once per table, not considered too >> much >> >> burden on the end users. Making global dictionary generation out of >> the >> way >> >> of regular data loading is the key here. >> >> >> >> Jihong >> >> >> >> -----Original Message----- >> >> From: Liang Chen [mailto: > chenliang6136@ > ] >> >> Sent: Thursday, October 13, 2016 5:39 PM >> >> To: > [email protected] >> >> Subject: RE: Discussion(New feature) regarding single pass data >> loading >> >> solution. >> >> >> >> Hi jihong >> >> >> >> I am not sure that users can accept to use extra tool to do this work, >> >> because provide tool or do scan at first time per table for most of >> global >> >> dict are same cost from users perspective, and maintain the dict file >> also >> >> be same cost, they always expecting that system can automatically and >> >> internally generate dict file during loading data. >> >> >> >> Can we consider this: >> >> first load: make scan to generate most of global dict file, then copy >> this >> >> file to each load node for subsequent loading >> >> >> >> Regards >> >> Liang >> >> >> >> >> >> Jihong Ma wrote >> >>>>>>> the question is what would be the default implementation? Load >> data >> >> without dictionary? >> >>> >> >>> My thought is we can provide a tool to generate global dictionary >> using >> >>> sample data set, so the initial global dictionaries is available >> before >> >>> normal data loading. We shall be able to perform encoding based on >> that, >> >>> we only need to handle occasionally adding entries while loading. For >> >>> columns specified with global dictionary encoding, but dictionary is >> not >> >>> placed before data loading, we error out and direct user to use the >> tool >> >>> first. >> >>> >> >>> Make sense? >> >>> >> >>> Jihong >> >>> >> >>> -----Original Message----- >> >>> From: Ravindra Pesala [mailto: >> >> >> >>> ravi.pesala@ >> >> >> >>> ] >> >>> Sent: Thursday, October 13, 2016 1:12 AM >> >>> To: dev >> >>> Subject: Re: Discussion(New feature) regarding single pass data >> loading >> >>> solution. >> >>> >> >>> Hi Jihong/Aniket, >> >>> >> >>> In the current implementation of carbondata we are already handling >> >>> external dictionary while loading the data. >> >>> But here the question is what would be the default implementation? >> Load >> >>> data with out dictionary? >> >>> >> >>> >> >>> Regards, >> >>> Ravi >> >>> >> >>> On 13 October 2016 at 03:50, Aniket Adnaik < >> >> >> >>> aniket.adnaik@ >> >> >> >>> > wrote: >> >>> >> >>>> Hi Ravi, >> >>>> >> >>>> 1. I agree with Jihong that creation of global dictionary should be >> >>>> optional, so that it can be disabled to improve the load >> performance. >> >>>> User >> >>>> should be made aware that using global dictionary may boost the >> query >> >>>> performance. >> >>>> 2. We should have a generic interface to manage global dictionary >> when >> >>>> its >> >>>> from external sources. In general, it is not a good idea to depend >> on >> >> too >> >>>> many external tools. >> >>>> 3. May be we should allow user to generate global dictionary >> separately >> >>>> through SQL command or similar. Something like materialized view. >> This >> >>>> means carbon should avoid using local dictionary and do late >> >>>> materialization when global dictionary is present. >> >>>> 4. May be we should think of some ways to create global dictionary >> >> lazily >> >>>> as we serve SELECT queries. Implementation may not be that straight >> >>>> forward. Not sure if its worth the effort. >> >>>> >> >>>> Best Regards, >> >>>> Aniket >> >>>> >> >>>> >> >>>> On Tue, Oct 11, 2016 at 7:59 PM, Jihong Ma < >> >> >> >>> Jihong.Ma@ >> >> >> >>> > wrote: >> >>>> >> >>>>> >> >>>>> A rather straight option is allow user to supply global dictionary >> >>>>> generated somewhere else or we build a separate tool just for >> >>>> generating >> >>>> as >> >>>>> well updating dictionary. Then the general normal data loading >> process >> >>>> will >> >>>>> encode columns with local dictionary if not supplied. This should >> >>>> cover >> >>>>> majority of cases for low-medium cardinality column. For the cases >> we >> >>>> have >> >>>>> to incorporate online dictionary update, use a lock mechanism to >> sync >> >>>> up >> >>>>> should serve the purpose. >> >>>>> >> >>>>> In another words, generating global dictionary is an optional step, >> >>>> only >> >>>>> triggered when needed, not a default step as we do currently. >> >>>>> >> >>>>> Jihong >> >>>>> >> >>>>> -----Original Message----- >> >>>>> From: Ravindra Pesala [mailto: >> >> >> >>> ravi.pesala@ >> >> >> >>> ] >> >>>>> Sent: Tuesday, October 11, 2016 2:33 AM >> >>>>> To: dev >> >>>>> Subject: Discussion(New feature) regarding single pass data loading >> >>>>> solution. >> >>>>> >> >>>>> Hi All, >> >>>>> >> >>>>> This discussion is regarding single pass data load solution. >> >>>>> >> >>>>> Currently data is loading to carbon in 2 pass/jobs >> >>>>> 1. Generating global dictionary using spark job. >> >>>>> 2. Encode the data with dictionary values and create carbondata >> >> files. >> >>>>> This 2 pass solution has many disadvantages like it needs to read >> the >> >>>> data >> >>>>> twice in case of csv files input or it needs to execute dataframe >> >> twice >> >>>> if >> >>>>> data is loaded from dataframe. >> >>>>> >> >>>>> In order to overcome from above issues of 2 pass dataloading, we >> can >> >>>> have >> >>>>> single pass dataloading and following are the alternate solutions. >> >>>>> >> >>>>> Use local dictionary >> >>>>> Use local dictionary for each carbondata file while loading data, >> but >> >>>> it >> >>>>> may lead to query performance degradation and more memory >> footprint. >> >>>>> >> >>>>> Use KV store/distributed map. >> >>>>> *HBase/Cassandra cluster : * >> >>>>> Dictionary data would be stored in KV store and generates the >> >>>> dictionary >> >>>>> value if it is not present in it. We all know the pros/cons of >> Hbase >> >>>> but >> >>>>> following are few. >> >>>>> Pros : These are apache licensed >> >>>>> Easy to implement to store/retreive dictionary values. >> >>>>> Performance need to be evaluated. >> >>>>> >> >>>>> Cons : Need to maintain seperate cluster for maintaining global >> >>>>> dictionary. >> >>>>> >> >>>>> *Hazlecast distributed map : * >> >>>>> Dictionary data could be saved in distributed concurrent hash map >> of >> >>>>> hazlecast. It is in-memory map and partioned as per number of >> nodes. >> >>>> And >> >>>>> even we can maintain the backups using sync/async functionality to >> >>>> avoid >> >>>>> the data loss when instance is down. We no need to maintain >> seperate >> >>>>> cluster for it as it can run on executor jvm itself. >> >>>>> Pros: It is apache licensed. >> >>>>> No need to maintain seperate cluster as instances can run in >> >>>>> executor jvms. >> >>>>> Easy to implement and store/retreive dictionary values. >> >>>>> It is pure java implementation. >> >>>>> There is no master/slave concept and no single point >> failure. >> >>>>> >> >>>>> Cons: Performance need to be evaluated. >> >>>>> >> >>>>> *Redis distributed map : * >> >>>>> It is also in-memory map but it is coded in c language so we >> >> should >> >>>>> have java client libraries to interact with redis. Need to maintain >> >>>>> seperate cluster for it. It also can partition the data. >> >>>>> Pros : More feature rich than Hazlecast. >> >>>>> Easy to implement and store/retreive dictionary values. >> >>>>> Cons : Need to maintain seperate cluster for maintaining global >> >>>>> dictionary. >> >>>>> May not be suitable for big data stack. >> >>>>> It is BSD licensed (Not sure whether we can use or not) >> >>>>> Online performance figures says it is little slower than >> hazlecast. >> >>>>> >> >>>>> Please let me know which would be best fit for our loading >> solution. >> >>>> And >> >>>>> please add any other suitable solution if I missed. >> >>>>> -- >> >>>>> Thanks & Regards, >> >>>>> Ravi >> >>>>> >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Thanks & Regards, >> >>> Ravi >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> View this message in context: http://apache-carbondata- >> >> mailing-list-archive.1130556.n5.nabble.com/Discussion-New- >> >> feature-regarding-single-pass-data-loading-solution-tp1761p1887.html >> >> Sent from the Apache CarbonData Mailing List archive mailing list >> archive >> >> at Nabble.com. >> >> >> > >> > >> > >> > -- >> > Thanks & Regards, >> > Ravi >> >> >> >> > > > -- > Thanks & Regards, > Ravi -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-New-feature-regarding-single-pass-data-loading-solution-tp1761p1984.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
