[
https://issues.apache.org/jira/browse/DRILL-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061360#comment-16061360
]
Rahul Challapalli commented on DRILL-5604:
------------------------------------------
Physical plan for both the queries looks identical
{code}
00-00 Screen : rowType = RecordType(BIGINT EXPR$0): rowcount = 1.0,
cumulative cost = {9.820759078689999E9 rows, 4.6137407764079994E10 cpu, 0.0 io,
2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163643
00-01 Project(EXPR$0=[$0]) : rowType = RecordType(BIGINT EXPR$0): rowcount
= 1.0, cumulative cost = {9.820759078589998E9 rows, 4.6137407763979996E10 cpu,
0.0 io, 2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163642
00-02 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) : rowType =
RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost =
{9.820759078589998E9 rows, 4.6137407763979996E10 cpu, 0.0 io,
2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163641
00-03 UnionExchange : rowType = RecordType(BIGINT EXPR$0): rowcount =
1.0, cumulative cost = {9.820759077589998E9 rows, 4.6137407751979996E10 cpu,
0.0 io, 2.3592861728768E12 network, 5.575656766064E10 memory}, id = 163640
01-01 StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) : rowType =
RecordType(BIGINT EXPR$0): rowcount = 1.0, cumulative cost =
{9.820759076589998E9 rows, 4.6137407743979996E10 cpu, 0.0 io,
2.3592861687808E12 network, 5.575656766064E10 memory}, id = 163639
01-02 HashAgg(group=[{0}]) : rowType = RecordType(DOUBLE
ss_list_price): rowcount = 2.879987999E7, cumulative cost =
{9.791959196599998E9 rows, 4.57918091841E10 cpu, 0.0 io, 2.3592861687808E12
network, 5.575656766064E10 memory}, id = 163638
01-03 Project(ss_list_price=[$0]) : rowType = RecordType(DOUBLE
ss_list_price): rowcount = 2.879987999E8, cumulative cost =
{9.503960396699999E9 rows, 4.34878187849E10 cpu, 0.0 io, 2.3592861687808E12
network, 5.06877887824E10 memory}, id = 163637
01-04 HashToRandomExchange(dist0=[[$0]]) : rowType =
RecordType(DOUBLE ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount =
2.879987999E8, cumulative cost = {9.503960396699999E9 rows, 4.34878187849E10
cpu, 0.0 io, 2.3592861687808E12 network, 5.06877887824E10 memory}, id = 163636
02-01 UnorderedMuxExchange : rowType = RecordType(DOUBLE
ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 2.879987999E8,
cumulative cost = {9.2159615968E9 rows, 3.88798379865E10 cpu, 0.0 io, 0.0
network, 5.06877887824E10 memory}, id = 163635
03-01 Project(ss_list_price=[$0],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) : rowType =
RecordType(DOUBLE ss_list_price, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount =
2.879987999E8, cumulative cost = {8.9279627969E9 rows, 3.85918391866E10 cpu,
0.0 io, 0.0 network, 5.06877887824E10 memory}, id = 163634
03-02 HashAgg(group=[{0}]) : rowType = RecordType(DOUBLE
ss_list_price): rowcount = 2.879987999E8, cumulative cost = {8.639963997E9
rows, 3.7439843987E10 cpu, 0.0 io, 0.0 network, 5.06877887824E10 memory}, id =
163633
03-03 Project(ss_list_price=[CAST($0):DOUBLE]) :
rowType = RecordType(DOUBLE ss_list_price): rowcount = 2.879987999E9,
cumulative cost = {5.759975998E9 rows, 1.4399939995E10 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 163632
03-04 Scan(groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=maprfs:///drill/testdata/tpcds/parquet/sf1000/store_sales]],
selectionRoot=maprfs:/drill/testdata/tpcds/parquet/sf1000/store_sales,
numFiles=1, usedMetadataFile=false, columns=[`ss_list_price`]]]) : rowType =
RecordType(ANY ss_list_price): rowcount = 2.879987999E9, cumulative cost =
{2.879987999E9 rows, 2.879987999E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
163631
{code}
> Possible performance degradation with hash aggregate when number of distinct
> keys increase
> ------------------------------------------------------------------------------------------
>
> Key: DRILL-5604
> URL: https://issues.apache.org/jira/browse/DRILL-5604
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 1.11.0
> Reporter: Rahul Challapalli
>
> git.commit.id.abbrev=90f43bf
> I tried to track the runtime as we gradually increase the no of distinct keys
> without increasing the total no of records. Below is one such test on top of
> tpcds sf1000 dataset
> {code}
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price)
> from store_sales;
> +---------+
> | EXPR$0 |
> +---------+
> | 19736 |
> +---------+
> 1 row selected (163.345 seconds)
> 0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit)
> from store_sales;
> +----------+
> | EXPR$0 |
> +----------+
> | 1525675 |
> +----------+
> 1 row selected (2094.962 seconds)
> {code}
> In both the above queries, the hash agg code processed 2879987999 records. So
> the time difference is due to overheads like hash table resizing etc. The
> second query took ~30 mins more than the first raising doubts whether there
> is an issue somewhere.
> The dataset is too large to attach to a jira and so are the logs
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)