[
https://issues.apache.org/jira/browse/DRILL-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180277#comment-15180277
]
Laurent Goujon commented on DRILL-4468:
---------------------------------------
I looked more into this with the help of [~jnadeau]. The issue is that
ExpressionTreeMaterializer tries to find the best function based on the
arguments ( which is NullExpression.INSTANCE), and the best one (on a cast
cost-basis) is the one taking VARCHAR.
It sounds fishy to me that an invalid function can be considered, this might be
something which should be revisited?
> Aggregates over empty input might fail
> --------------------------------------
>
> Key: DRILL-4468
> URL: https://issues.apache.org/jira/browse/DRILL-4468
> Project: Apache Drill
> Issue Type: Bug
> Environment: Linux/OpenJDK 7
> Reporter: Laurent Goujon
>
> Some aggregation queries over empty input might fail, depending of the column
> ordering.
> This query for example would fail:
> {noformat}
> select sum(int_col) col1, sum(bigint_col) col2 from cp.`employee.json` where
> 1 = 0
> org.apache.drill.common.exceptions.UserRemoteException: UNSUPPORTED_OPERATION
> ERROR: Only COUNT, MIN and MAX aggregate functions supported for VarChar type
> Fragment 0:0 [Error Id: dcef042c-1c53-40df-88b0-816d3cb109a7 on xxx:31010]
> {noformat}
> But this one would succeed:
> {noformat}
> select sum(bigint_col) col2, sum(int_col) col1 from cp.`employee.json` where
> 1 = 0
> null null
> {noformat}
> The reason for why only one query fails is because of DRILL-4467. The
> consequence is that the plans are significantly different, and don't behave
> quite the same way.
> Here's the Physical plan for the first query:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0,
> cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 339
> 00-01 Project(col1=[$0], col2=[$1]) : rowType = RecordType(ANY col1, ANY
> col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 338
> 00-02 StreamAgg(group=[{}], col1=[SUM($0)], col2=[SUM($1)]) : rowType
> = RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0
> rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 337
> 00-03 Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY
> int_col, ANY bigint_col): rowcount = 1.0, cumulative cost = {463.0 rows,
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 336
> 00-04 Scan(groupscan=[EasyGroupScan
> [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`,
> `bigint_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY
> int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows,
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 335
> {noformat}
> and the physical plan for the second query:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0,
> cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 775
> 00-01 Project(col2=[$0], col1=[$1]) : rowType = RecordType(ANY col2, ANY
> col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 774
> 00-02 StreamAgg(group=[{}], col2=[SUM($0)], col1=[SUM($1)]) : rowType
> = RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0
> rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 773
> 00-03 Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY
> bigint_col, ANY int_col): rowcount = 1.0, cumulative cost = {463.0 rows,
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 772
> 00-04 Project(bigint_col=[$1], int_col=[$0]) : rowType =
> RecordType(ANY bigint_col, ANY int_col): rowcount = 463.0, cumulative cost =
> {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 771
> 00-05 Scan(groupscan=[EasyGroupScan
> [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`bigint_col`,
> `int_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY
> int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows,
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 770
> {noformat}
> The extra projection just before the scan seems to hide the VARCHAR type of
> the columns, and allow for aggregation to succeed. On the other hand, the
> storage plugin allows for column push down, so the projection is
> theoretically unnecessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)