[ 
https://issues.apache.org/jira/browse/DRILL-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180277#comment-15180277
 ] 

Laurent Goujon commented on DRILL-4468:
---------------------------------------

I looked more into this with the help of [~jnadeau]. The issue is that 
ExpressionTreeMaterializer tries to find the best function based on the 
arguments ( which is NullExpression.INSTANCE), and the best one (on a cast 
cost-basis) is the one taking VARCHAR.

It sounds fishy to me that an invalid function can be considered, this might be 
something which should be revisited?

> Aggregates over empty input might fail
> --------------------------------------
>
>                 Key: DRILL-4468
>                 URL: https://issues.apache.org/jira/browse/DRILL-4468
>             Project: Apache Drill
>          Issue Type: Bug
>         Environment: Linux/OpenJDK 7
>            Reporter: Laurent Goujon
>
> Some aggregation queries over empty input might fail, depending of the column 
> ordering.
> This query for example would fail:
> {noformat}
> select sum(int_col) col1, sum(bigint_col) col2 from cp.`employee.json` where 
> 1 = 0
> org.apache.drill.common.exceptions.UserRemoteException: UNSUPPORTED_OPERATION 
> ERROR: Only COUNT, MIN and MAX aggregate functions supported for VarChar type 
> Fragment 0:0 [Error Id: dcef042c-1c53-40df-88b0-816d3cb109a7 on xxx:31010] 
> {noformat}
> But this one would succeed:
> {noformat}
> select sum(bigint_col) col2, sum(int_col) col1 from cp.`employee.json` where 
> 1 = 0
> null    null
> {noformat}
> The reason for why only one query fails is because of DRILL-4467. The 
> consequence is that the plans are significantly different, and don't behave 
> quite the same way.
> Here's the Physical plan for the first query:
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY col1, ANY col2): rowcount = 1.0, 
> cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
> id = 339
> 00-01      Project(col1=[$0], col2=[$1]) : rowType = RecordType(ANY col1, ANY 
> col2): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 338
> 00-02        StreamAgg(group=[{}], col1=[SUM($0)], col2=[SUM($1)]) : rowType 
> = RecordType(ANY col1, ANY col2): rowcount = 1.0, cumulative cost = {464.0 
> rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 337
> 00-03          Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY 
> int_col, ANY bigint_col): rowcount = 1.0, cumulative cost = {463.0 rows, 
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 336
> 00-04            Scan(groupscan=[EasyGroupScan 
> [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`int_col`, 
> `bigint_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY 
> int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows, 
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 335
> {noformat} 
> and the physical plan for the second query:
> {noformat}
> 00-00    Screen : rowType = RecordType(ANY col2, ANY col1): rowcount = 1.0, 
> cumulative cost = {464.1 rows, 950.1 cpu, 0.0 io, 0.0 network, 0.0 memory}, 
> id = 775
> 00-01      Project(col2=[$0], col1=[$1]) : rowType = RecordType(ANY col2, ANY 
> col1): rowcount = 1.0, cumulative cost = {464.0 rows, 950.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 774
> 00-02        StreamAgg(group=[{}], col2=[SUM($0)], col1=[SUM($1)]) : rowType 
> = RecordType(ANY col2, ANY col1): rowcount = 1.0, cumulative cost = {464.0 
> rows, 950.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 773
> 00-03          Limit(offset=[0], fetch=[0]) : rowType = RecordType(ANY 
> bigint_col, ANY int_col): rowcount = 1.0, cumulative cost = {463.0 rows, 
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 772
> 00-04            Project(bigint_col=[$1], int_col=[$0]) : rowType = 
> RecordType(ANY bigint_col, ANY int_col): rowcount = 463.0, cumulative cost = 
> {463.0 rows, 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 771
> 00-05              Scan(groupscan=[EasyGroupScan 
> [selectionRoot=classpath:/employee.json, numFiles=1, columns=[`bigint_col`, 
> `int_col`], files=[classpath:/employee.json]]]) : rowType = RecordType(ANY 
> int_col, ANY bigint_col): rowcount = 463.0, cumulative cost = {463.0 rows, 
> 926.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 770
> {noformat}
> The extra projection just before the scan seems to hide the VARCHAR type of 
> the columns, and allow for aggregation to succeed. On the other hand, the 
> storage plugin allows for column push down, so the projection is 
> theoretically unnecessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to