Laurent Goujon created DRILL-4468:
-------------------------------------
Summary: Aggregates over empty input might fail
Key: DRILL-4468
URL: https://issues.apache.org/jira/browse/DRILL-4468
Project: Apache Drill
Issue Type: Bug
Environment: Linux/OpenJDK 7
Reporter: Laurent Goujon
Some aggregation queries over empty input might fail, depending of the column
ordering.
This query for example would fail:
{noformat}
select sum(int_col) col1, sum(bigint_col) col2 from cp.`employee.json` where 1
= 0
org.apache.drill.common.exceptions.UserRemoteException: UNSUPPORTED_OPERATION
ERROR: Only COUNT, MIN and MAX aggregate functions supported for VarChar type
Fragment 0:0 [Error Id: dcef042c-1c53-40df-88b0-816d3cb109a7 on xxx:31010]
{noformat}
But this one would succeed:
{noformat}
select sum(bigint_col) col2, sum(int_col) col1 from cp.`employee.json` where 1
= 0
null null
{noformat}
The reason for why only one query fails is because of DRILL-4467. The
consequence is that the plans are significantly different, and don't behave
quite the same way.
Here's the Physical plan for the first query:
{noformat}
00-00 Screen : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0,
cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
= 339
00-01 Project(col1=[$0], col2=[$1]) : rowType = RecordType(ANY col1, ANY
col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 338
00-02 StreamAgg(group=[{}], col1=[SUM($0)], col2=[SUM($1)]) : rowType =
RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0 rows,
950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 337
00-03 Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY int_col,
ANY bigint_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0 cpu, 0.0
io, 0.0 network, 0.0 memory}, id = 336
00-04 Scan(groupscan=[EasyGroupScan
[selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`,
`bigint_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY
int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows,
926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 335
{noformat}
and the physical plan for the second query:
{noformat}
00-00 Screen : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0,
cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
= 775
00-01 Project(col2=[$0], col1=[$1]) : rowType = RecordType(ANY col2, ANY
col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 774
00-02 StreamAgg(group=[{}], col2=[SUM($0)], col1=[SUM($1)]) : rowType =
RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0 rows,
950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 773
00-03 Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY
bigint_col, ANY int_col): rowcount = 1.0, cumulative cost = {463.0 rows, 926.0
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 772
00-04 Project(bigint_col=[$1], int_col=[$0]) : rowType =
RecordType(ANY bigint_col, ANY int_col): rowcount = 463.0, cumulative cost =
{463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 771
00-05 Scan(groupscan=[EasyGroupScan
[selectionRoot=classpath:/employee.json, numFiles=1, columns=[`bigint_col`,
`int_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY
int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows,
926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 770
{noformat}
The extra projection just before the scan seems to hide the VARCHAR type of the
columns, and allow for aggregation to succeed. On the other hand, the storage
plugin allows for column push down, so the projection is theoretically
unnecessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)