Hi jacky

One question : Can you explain that proposed CarbonData Storage Service
would store what information?  For users how to pre-configure memory
resource for the service? as big as possible memory?
--------------------------------------------------------------------------------------------------------
while CarbonData requires its own memory cache.

Regards
Liang



2017-05-14 0:19 GMT-04:00 Jacky Li <jacky.li...@qq.com>:

> Hi community,
>
> Partition feature is proposed by Cao Lu in thread (
> http://apache-carbondata-dev-mailing-list-archive.1130556.
> n5.nabble.com/Discussion-Implement-Partition-Table-
> Feature-td10938.html#a11321 <http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/Discussion-
> Implement-Partition-Table-Feature-td10938.html#a11321>), implementation
> effort is on going.
>
> After partition is implemented, point query using sort columns is expected
> to be faster than current B-Tree index approach. To further boost its
> performance and achieve higher concurrency, I want to discuss to provide a
> service for CarbonData.
>
> Following is the proposal:
>
> CarbonData Storage Service
> At the moment, CarbonData project mainly defines a columnar format with
> index support. These CarbonData files are read and write in process
> framework (like in spark executor), they are efficient for
> OLAP/DataWarehouse kind of workload, however, there are overheads for
> simple query like point queries. For example, in spark, DAG break down,
> Task scheduling, Task serialization/deserialization is inevitable.
> Furthermore, executor memory is meant for control by spark core, while
> CarbonData requires its own memory cache.
>
> So, in order to improve on it, I suggest to add a Storage Service in
> CarbonData project. The main goal of this service is to serve point query
> and manage carbon data storage.
>
> 1. Deployment
> This service can be embedded in process framework (spark executor) like
> current way, or deploy a new self-managed process in HDFS data node. For
> latter approach, we can implement a YARN application to manage these
> processes.
>
> 2. Communication
> There will be service client communicate with service. One simple approach
> is re-use the current netty RPC framework we have for dictionary generation
> in single-pass loading. We need to add configure for RPC ports for this
> service.
>
> 3. Functionality
> I can think of a few functionalities that this service can provide, you
> can suggest more.
>         1) Serving point query
>         The query filter is consist of PARTITION_COLUMN and SORT_COLUMN,
> the client will send a RPC request to the service, the service open the
> request file and locate the offset by SORT_COLUMN and start scanning. The
> reading of CarbonData remains no change as in current CarbonData
> RecordReader. Once result data is collected, return it  through RPC
> response to the client.
>         By optimizing client and service side handling and its payload in
> RPC, it should be more efficient than spark Task.
>
>         2) Cache management
>         Currently, CarbonData caches file level index in spark executor,
> this is not desired especially dynamic allocation is enabled in spark. By
> adding this Storage Service, CarbonData can have better management of this
> cache inside its own memory space. Besides index cache, we can also
> consider to add cache for hot block/blocklet, so that further reducing IO
> and latency.
>
>         3) Compaction management
>         Now, SORT_COLUMN keyword is planned for CarbonData 1.2 and user
> can use it to force NO SORT for the table to make loading faster. And there
> is option for BATCH_SORT also. By adding this service, we can implement
> some policy in the service to trigger compaction to do larger scope sorting
> than its initial loading.
>
> We may identify and add more functionality in this service in the future.
>
> How do you think about this idea?
>
> Regards,
> Jacky
>
>
>




-- 
Regards
Liang

Reply via email to