[
https://issues.apache.org/jira/browse/DRILL-5774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157746#comment-16157746
]
Robert Hou commented on DRILL-5774:
-----------------------------------
Here is the plan:
{noformat}
| 00-00 Screen
00-01 Project(EXPR$0=[$0])
00-02 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)])
00-03 UnionExchange
01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()])
01-02 Project($f0=[0])
01-03 SelectionVectorRemover
01-04 Filter(condition=[=($0, 0)])
01-05 SingleMergeExchange(sort0=[1 ASC])
02-01 SelectionVectorRemover
02-02 Sort(sort0=[$1], dir0=[ASC])
02-03 Project(id=[$0], str=[$1])
02-04 HashToRandomExchange(dist0=[[$1]])
03-01 UnorderedMuxExchange
04-01 Project(id=[$0], str=[$1],
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($1, 1301011)])
04-02 Flatten(flattenField=[$1])
04-03 Project(id=[$0], str=[$1])
04-04 Scan(groupscan=[EasyGroupScan
[selectionRoot=maprfs:/drill/testdata/resource-manager/flatten-large-small.json,
numFiles=1, columns=[`id`, `str_list`],
files=[maprfs:///drill/testdata/resource-manager/flatten-large-small.json]]])
{noformat}
One of the operators between the Scan and the Sort allocated the extra memory
for the batch. Flatten is likely a good candidate to look at.
> Excessive memory allocation
> ---------------------------
>
> Key: DRILL-5774
> URL: https://issues.apache.org/jira/browse/DRILL-5774
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 1.11.0
> Reporter: Robert Hou
> Assignee: Paul Rogers
> Fix For: 1.12.0
>
>
> This query exhibits excessive memory allocation:
> {noformat}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> alter session set `planner.width.max_per_node` = 1;
> alter session set `planner.disable_exchanges` = true;
> alter session set `planner.width.max_per_query` = 1;
> select count(*) from (select * from (select id, flatten(str_list) str from
> dfs.`/drill/testdata/resource-manager/flatten-large-small.json`) d order by
> d.str) d1 where d1.id=0;
> {noformat}
> This query does a flatten on a large table. The result is 160M records.
> Half the records have a one-byte string, and half have a 253-byte string.
> And then there are 40K records with 223 byte strings.
> {noformat}
> select length(str), count(*) from (select id, flatten(str_list) str from
> dfs.`/drill/testdata/resource-manager/flatten-large-small.json`) group by
> length(str);
> +---------+-----------+
> | EXPR$0 | EXPR$1 |
> +---------+-----------+
> | 223 | 40000 |
> | 1 | 80042001 |
> | 253 | 80000000 |
> {noformat}
> From the drillbit.log:
> {noformat}
> 2017-09-02 11:43:44,598 [26550427-6adf-a52e-2ea8-dc52d8d8433f:frag:0:0] DEBUG
> o.a.d.e.p.i.x.m.ExternalSortBatch - Actual batch schema & sizes {
> str(type: REQUIRED VARCHAR, count: 4096, std size: 54, actual size: 134,
> data size: 548360)
> id(type: OPTIONAL BIGINT, count: 4096, std size: 8, actual size: 9, data
> size: 36864)
> Records: 4096, Total size: 1073819648, Data size: 585224, Gross row width:
> 262163, Net row width: 143, Density: 1}
> {noformat}
> The data size is 585K, but the batch size is 1 GB. The density is 1%.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)