suyash yadav created CARBONDATA-4177:
----------------------------------------

             Summary: performence issue with Query
                 Key: CARBONDATA-4177
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-4177
             Project: CarbonData
          Issue Type: Bug
          Components: core
    Affects Versions: 2.0.1
            Reporter: suyash yadav
             Fix For: 2.0.1


Hi Team,Hi Team,
We are working on a POC using carbondata 2.0.1 and have come across parformance 
issue.Below are the details:
1.Table creation query:

======================
spark.sql("create table Flow_TS_1day_stats_16042021(start_time 
timestamp,end_time timestamp,source_ip_address string,destintion_ip_address 
string,appname string,protocol_name string,source_tos smallint,in_interface 
smallint,out_interface smallint,src_as bigint,dst_as bigint,source_mask 
smallint,destination_mask smallint, dst_tos smallint,input_pkt bigint,input_byt 
bigint,output_pkt bigint,output_byt bigint,source_port int,destination_port 
int) stored as carbondata TBLPROPERTIES 
('local_dictionary_enable'='false')").show()

TWO MVs are there on this table, Below are the queries for those MVs

:===========================================================
1. Network MV

================
spark.sql("create materialized view 
Network_Level_Agg_10min_MV_with_ip_15042021_again as select 
timeseries(end_time,'ten_minute') as end_time,source_ip_address, 
destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,sum(input_pkt)
 as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as 
output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_15042021_again 
group by 
timeseries(end_time,'ten_minute'),source_ip_address,destintion_ip_address, 
appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as 
order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)

2. Interfae MV:

==================Interface :==============
spark.sql("create materialized view Interface_Level_Agg_10min_MV_16042021 as 
select timeseries(end_time,'ten_minute') as end_time, 
source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface,sum(input_pkt)
 as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as 
output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_16042021 group 
by timeseries(end_time,'ten_minute'), 
source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface
 order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)


+*We are firing below query for fethcing data which is taking almost 10 
seconds:*+


*Select appname,input_byt from Flow_TS_1day_stats_16042021 where end_time >= 
'2021-03-02 00:00:00' and end_time < '2021-03-03 00:00:00' group by 
appname,input_byt order by input_byt desc LIMIT 10*

 


The above query is only fetching 10 records but it is taking almost 10 seconds 
to complete.
Could you please review above schemas and help us to understand how can we get 
some improvement in the qury execution time. We are expectingt he response 
should be in subseconds.

Table Name : RAW Table (1 Day - 300K/Sec)#Records : 25920000000


RegardsSuyash Yadav                          



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to