Sushant Sammanwar created CARBONDATA-4187:
---------------------------------------------

             Summary: Performance Issue with Materialized views - increased 
loading time due to full refresh
                 Key: CARBONDATA-4187
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4187
             Project: CarbonData
          Issue Type: Bug
          Components: core
    Affects Versions: 2.1.0
            Reporter: Sushant Sammanwar


Hi Team ,

We have been doing a POC by using Carbon 2.1.0 and created a wrapper code 
around carbon and deployed it as docker container.
Concurrent data loading is happening in many tables.
Our objective if get optimal performance for aggregated queries and using 
materialized views .
Our observation is after creating MVs data loading is slow and not able to 
keep-up the pace of incoming data .
Process is also consuming a lot of memory when MVs are created .
Data is received in continuous manner and MVs are refreshed which is resulting 
in increased load time.
Ideally MVs should only perform incremental refresh as it doesnot require to 
calculate old data again.
But it seems the full refresh is causing high memory usages and increased 
loading time.

Testing involved loading data without MVs for 6 hrs , then creating MVs and 
load data again for 4 hours.
Loading time with MVs increased there creating backlog of data ( loaded only 
1/5 th no. of rows than expected).

Below are major bottlenecks observed :
1. High Memory consumption after creating MVs
2. MVs doing a full refresh

Please find attached details of testing with list of tables.
Below is definition of table :

create table if not exists fact_365_1_eutrancell_1 (ts timestamp, metric 
STRING, tags_id STRING, value DOUBLE, epoch bigint) partitioned by (ts2 
timestamp) STORED AS carbondata TBLPROPERTIES ('SORT_COLUMNS'='metric')

Below is definition of MV :

create materialized view if not exists fact_365_1_eutrancell_1_hour as select 
tags_id ,metric,timeseries(ts,'hour') as 
ts,sum(value),avg(value),min(value),max(value) from fact_365_1_eutrancell_1 
group by metric, tags_id, timeseries(ts,'hour')

Can you suggest why MV creation is slowing down the ingestion so much and what 
can be done to improve ?
Is there any way to have incremental refresh of MV - refresh only that hour for 
which we are loading the data ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to