Yes Jacky, interfaces needs to be revisited. For Goal 1 and Goal 2: abstraction required for both Index and Index store. Also multi-column index(composite index) needs to be considered.
Regards, Ramana On Sat, Oct 1, 2016 at 11:01 AM, Jacky Li <[email protected]> wrote: > Hi community, > > Currently CarbonData have builtin index support which is one of the key > strength of CarbonData. Using index, CarbonData can do very fast filter > query by filtering on block and blocklet level. However, it also introduces > memory consumption of the index tree and impact first query time because > the > process of loading of index from file footer into memory. On the other > side, > in a multi-tennant environment, multiple applications may access data files > simultaneously, which again exacerbate this resource consumption issue. > So, I want to propose and discuss a solution with you to solve this > problem and make an abstraction of interface for CarbonData's future > evolvement. > I am thinking the final result of this work should achieve at least two > goals: > > Goal 1: User can choose the place to store Index data, it can be stored in > processing framework's memory space (like in spark driver memory) or in > another service outside of the processing framework (like using a > independent database service) > > Goal 2: Developer can add more index of his choice to CarbonData files. > Besides B+ tree on multi-dimensional key which current CarbonData supports, > developers are free to add other indexing technology to make certain > workload faster. These new indices should be added in a pluggable way. > > In order to achieve these goals, an abstraction need to be created for > CarbonData project, including: > > - Segment: each segment is presenting one load of data, and tie with some > indices created with this load > > - Index: index is created when this segment is created, and is leveraged > when CarbonInputFormat's getSplit is called, to filter out the required > blocks or even blocklets. > > - CarbonInputFormat: There maybe n number of indices created for data file, > when querying these data files, InputFormat should know how to access these > indices, and initialize or loading these index if required. > > Obviously, this work should be separated into different tasks and > implemented gradually. But first of all, let's discuss on the goal and the > proposed approach. What is your idea? > > Regards, > Jacky > > > > > > -- > View this message in context: http://apache-carbondata- > mailing-list-archive.1130556.n5.nabble.com/Abstracting- > CarbonData-s-Index-Interface-tp1587.html > Sent from the Apache CarbonData Mailing List archive mailing list archive > at Nabble.com. >
