Abstracting CarbonData's Index Interface

Jacky Li Fri, 30 Sep 2016 22:31:27 -0700

Hi community,

    Currently CarbonData have builtin index support which is one of the key
strength of CarbonData. Using index, CarbonData can do very fast filter
query by filtering on block and blocklet level. However, it also introduces
memory consumption of the index tree and impact first query time because the
process of loading of index from file footer into memory. On the other side,
in a multi-tennant environment, multiple applications may access data files
simultaneously, which again exacerbate this resource consumption issue. 
    So, I want to propose and discuss a solution with you to solve this
problem and make an abstraction of interface for CarbonData's future
evolvement.
    I am thinking the final result of this work should achieve at least two
goals:
    
Goal 1: User can choose the place to store Index data, it can be stored in
processing framework's memory space (like in spark driver memory) or in
another service outside of the processing framework (like using a
independent database service)


Goal 2: Developer can add more index of his choice to CarbonData files.
Besides B+ tree on multi-dimensional key which current CarbonData supports,
developers are free to add other indexing technology to make certain
workload faster. These new indices should be added in a pluggable way.

     In order to achieve these goals, an abstraction need to be created for
CarbonData project, including: 

- Segment: each segment is presenting one load of data, and tie with some
indices created with this load

- Index: index is created when this segment is created, and is leveraged
when CarbonInputFormat's getSplit is called, to filter out the required
blocks or even blocklets.

- CarbonInputFormat: There maybe n number of indices created for data file,
when querying these data files, InputFormat should know how to access these
indices, and initialize or loading these index if required.

    Obviously, this work should be separated into different tasks and
implemented gradually. But first of all, let's discuss on the goal and the
proposed approach. What is your idea? 
 
Regards,
Jacky





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Abstracting-CarbonData-s-Index-Interface-tp1587.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Abstracting CarbonData's Index Interface

Reply via email to