[jira] [Commented] (DRILL-5138) TopN operator on top of ~110 GB data set is very slow

Timothy Farkas (JIRA) Thu, 07 Sep 2017 10:49:45 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157326#comment-16157326
 ]


Timothy Farkas commented on DRILL-5138:
---------------------------------------

Hi [~rkins],

It looks like it is taking the TopN operator almost a second to get each batch 
of 1023 records from the upstream operator and there are almost 1.5 million 
batches in total. This suggests the issue is occurring before the TopN 
operator. Could you try removing the limit from the query and seeing how long 
that takes?

create table test as select * from catalog_sales order by cs_quantity, 
cs_wholesale_cost; 

> TopN operator on top of ~110 GB data set is very slow
> -----------------------------------------------------
>
>                 Key: DRILL-5138
>                 URL: https://issues.apache.org/jira/browse/DRILL-5138
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Rahul Challapalli
>            Assignee: Timothy Farkas
>
> git.commit.id.abbrev=cf2b7c7
> No of cores : 23
> No of disks : 5
> DRILL_MAX_DIRECT_MEMORY="24G"
> DRILL_MAX_HEAP="12G"
> The below query ran for more than 4 hours and did not complete. The table is 
> ~110 GB
> {code}
> select * from catalog_sales order by cs_quantity, cs_wholesale_cost limit 1;
> {code}
> Physical Plan :
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 1.0, cumulative 
> cost = {1.00798629141E10 rows, 4.17594320691E10 cpu, 0.0 io, 
> 4.1287118487552E13 network, 0.0 memory}, id = 352
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 1.0, 
> cumulative cost = {1.0079862914E10 rows, 4.1759432069E10 cpu, 0.0 io, 
> 4.1287118487552E13 network, 0.0 memory}, id = 351
> 00-02        Project(T0¦¦*=[$0]) : rowType = RecordType(ANY T0¦¦*): rowcount 
> = 1.0, cumulative cost = {1.0079862914E10 rows, 4.1759432069E10 cpu, 0.0 io, 
> 4.1287118487552E13 network, 0.0 memory}, id = 350
> 00-03          SelectionVectorRemover : rowType = RecordType(ANY T0¦¦*, ANY 
> cs_quantity, ANY cs_wholesale_cost): rowcount = 1.0, cumulative cost = 
> {1.0079862914E10 rows, 4.1759432069E10 cpu, 0.0 io, 4.1287118487552E13 
> network, 0.0 memory}, id = 349
> 00-04            Limit(fetch=[1]) : rowType = RecordType(ANY T0¦¦*, ANY 
> cs_quantity, ANY cs_wholesale_cost): rowcount = 1.0, cumulative cost = 
> {1.0079862913E10 rows, 4.1759432068E10 cpu, 0.0 io, 4.1287118487552E13 
> network, 0.0 memory}, id = 348
> 00-05              SingleMergeExchange(sort0=[1 ASC], sort1=[2 ASC]) : 
> rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost): 
> rowcount = 1.439980416E9, cumulative cost = {1.0079862912E10 rows, 
> 4.1759432064E10 cpu, 0.0 io, 4.1287118487552E13 network, 0.0 memory}, id = 347
> 01-01                SelectionVectorRemover : rowType = RecordType(ANY T0¦¦*, 
> ANY cs_quantity, ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative 
> cost = {8.639882496E9 rows, 3.0239588736E10 cpu, 0.0 io, 2.3592639135744E13 
> network, 0.0 memory}, id = 346
> 01-02                  TopN(limit=[1]) : rowType = RecordType(ANY T0¦¦*, ANY 
> cs_quantity, ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative 
> cost = {7.19990208E9 rows, 2.879960832E10 cpu, 0.0 io, 2.3592639135744E13 
> network, 0.0 memory}, id = 345
> 01-03                    Project(T0¦¦*=[$0], cs_quantity=[$1], 
> cs_wholesale_cost=[$2]) : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, 
> ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative cost = 
> {5.759921664E9 rows, 2.879960832E10 cpu, 0.0 io, 2.3592639135744E13 network, 
> 0.0 memory}, id = 344
> 01-04                      HashToRandomExchange(dist0=[[$1]], dist1=[[$2]]) : 
> rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 1.439980416E9, cumulative cost = 
> {5.759921664E9 rows, 2.879960832E10 cpu, 0.0 io, 2.3592639135744E13 network, 
> 0.0 memory}, id = 343
> 02-01                        UnorderedMuxExchange : rowType = RecordType(ANY 
> T0¦¦*, ANY cs_quantity, ANY cs_wholesale_cost, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 1.439980416E9, cumulative cost = 
> {4.319941248E9 rows, 1.1519843328E10 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
> id = 342
> 03-01                          Project(T0¦¦*=[$0], cs_quantity=[$1], 
> cs_wholesale_cost=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($2, 
> hash32AsDouble($1))]) : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, ANY 
> cs_wholesale_cost, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 1.439980416E9, 
> cumulative cost = {2.879960832E9 rows, 1.0079862912E10 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 341
> 03-02                            Project(T0¦¦*=[$0], cs_quantity=[$1], 
> cs_wholesale_cost=[$2]) : rowType = RecordType(ANY T0¦¦*, ANY cs_quantity, 
> ANY cs_wholesale_cost): rowcount = 1.439980416E9, cumulative cost = 
> {1.439980416E9 rows, 4.319941248E9 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
> = 340
> 03-03                              Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=maprfs:///drill/testdata/tpcds/parquet/sf1000/catalog_sales]], 
> selectionRoot=maprfs:/drill/testdata/tpcds/parquet/sf1000/catalog_sales, 
> numFiles=1, usedMetadataFile=false, columns=[`*`]]]) : rowType = 
> (DrillRecordRow[*, cs_quantity, cs_wholesale_cost]): rowcount = 
> 1.439980416E9, cumulative cost = {1.439980416E9 rows, 4.319941248E9 cpu, 0.0 
> io, 0.0 network, 0.0 memory}, id = 339
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (DRILL-5138) TopN operator on top of ~110 GB data set is very slow

Reply via email to