[ 
https://issues.apache.org/jira/browse/DRILL-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061360#comment-16061360
 ] 

Rahul Challapalli commented on DRILL-5604:
------------------------------------------

Physical plan for both the queries looks identical
{code}
00-00    Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0, 
cumulative cost = {9.820759078689999E9 rows, 4.6137407764079994E10 cpu, 0.0 io, 
2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163643
00-01      Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount 
= 1.0, cumulative cost = {9.820759078589998E9 rows, 4.6137407763979996E10 cpu, 
0.0 io, 2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163642
00-02        StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType = 
RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = 
{9.820759078589998E9 rows, 4.6137407763979996E10 cpu, 0.0 io, 
2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163641
00-03          UnionExchange : rowType = RecordType(BIGINT EXPR$0): rowcount = 
1.0, cumulative cost = {9.820759077589998E9 rows, 4.6137407751979996E10 cpu, 
0.0 io, 2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163640
01-01            StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) : rowType = 
RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost = 
{9.820759076589998E9 rows, 4.6137407743979996E10 cpu, 0.0 io, 
2.3592861687808E12 network, 5.575656766064E10 memory}, id = 163639
01-02              HashAgg(group=[{0}]) : rowType = RecordType(DOUBLE 
ss_list_price): rowcount = 2.879987999E7, cumulative cost = 
{9.791959196599998E9 rows, 4.57918091841E10 cpu, 0.0 io, 2.3592861687808E12 
network, 5.575656766064E10 memory}, id = 163638
01-03                Project(ss_list_price=[$0]) : rowType = RecordType(DOUBLE 
ss_list_price): rowcount = 2.879987999E8, cumulative cost = 
{9.503960396699999E9 rows, 4.34878187849E10 cpu, 0.0 io, 2.3592861687808E12 
network, 5.06877887824E10 memory}, id = 163637
01-04                  HashToRandomExchange(dist0=[[$0]]) : rowType = 
RecordType(DOUBLE ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
2.879987999E8, cumulative cost = {9.503960396699999E9 rows, 4.34878187849E10 
cpu, 0.0 io, 2.3592861687808E12 network, 5.06877887824E10 memory}, id = 163636
02-01                    UnorderedMuxExchange : rowType = RecordType(DOUBLE 
ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 2.879987999E8, 
cumulative cost = {9.2159615968E9 rows, 3.88798379865E10 cpu, 0.0 io, 0.0 
network, 5.06877887824E10 memory}, id = 163635
03-01                      Project(ss_list_price=[$0], 
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) : rowType = 
RecordType(DOUBLE ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
2.879987999E8, cumulative cost = {8.9279627969E9 rows, 3.85918391866E10 cpu, 
0.0 io, 0.0 network, 5.06877887824E10 memory}, id = 163634
03-02                        HashAgg(group=[{0}]) : rowType = RecordType(DOUBLE 
ss_list_price): rowcount = 2.879987999E8, cumulative cost = {8.639963997E9 
rows, 3.7439843987E10 cpu, 0.0 io, 0.0 network, 5.06877887824E10 memory}, id = 
163633
03-03                          Project(ss_list_price=[CAST($0):DOUBLE]) : 
rowType = RecordType(DOUBLE ss_list_price): rowcount = 2.879987999E9, 
cumulative cost = {5.759975998E9 rows, 1.4399939995E10 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 163632
03-04                            Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=maprfs:///drill/testdata/tpcds/parquet/sf1000/store_sales]], 
selectionRoot=maprfs:/drill/testdata/tpcds/parquet/sf1000/store_sales, 
numFiles=1, usedMetadataFile=false, columns=[`ss_list_price`]]]) : rowType = 
RecordType(ANY ss_list_price): rowcount = 2.879987999E9, cumulative cost = 
{2.879987999E9 rows, 2.879987999E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 
163631
{code}

> Possible performance degradation with hash aggregate when number of distinct 
> keys increase
> ------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5604
>                 URL: https://issues.apache.org/jira/browse/DRILL-5604
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Rahul Challapalli
>
> git.commit.id.abbrev=90f43bf
> I tried to track the runtime as we gradually increase the no of distinct keys 
> without increasing the total no of records. Below is one such test on top of 
> tpcds sf1000 dataset
> {code}
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price) 
> from store_sales;
> +---------+
> | EXPR$0  |
> +---------+
> | 19736   |
> +---------+
> 1 row selected (163.345 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit) 
> from store_sales;
> +----------+
> |  EXPR$0  |
> +----------+
> | 1525675  |
> +----------+
> 1 row selected (2094.962 seconds)
> {code}
> In both the above queries, the hash agg code processed 2879987999 records. So 
> the time difference is due to overheads like hash table resizing etc. The 
> second query took ~30 mins more than the first raising doubts whether there 
> is an issue somewhere.
> The dataset is too large to attach to a jira and so are the logs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to