Hari Sankar Sivarama Subramaniyan created HIVE-7188: -------------------------------------------------------
Summary: sum(if()) returns wrong results with vectorization Key: HIVE-7188 URL: https://issues.apache.org/jira/browse/HIVE-7188 Project: Hive Issue Type: Bug Reporter: Hari Sankar Sivarama Subramaniyan Assignee: Hari Sankar Sivarama Subramaniyan Attachments: hike-vector-sum-bug.tgz 1. The tgz file containing the setup is attached. 2. Run the following query select sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning from hike_error.ttr_day0; returns 0 rows with vectorization turned on whereas it return 131 rows with vectorization turned off. hive> source insert.sql > ; OK Time taken: 0.359 seconds OK Time taken: 0.015 seconds OK Time taken: 0.069 seconds OK Time taken: 0.176 seconds Loading data to table hike_error.ttr_day0 Table hike_error.ttr_day0 stats: [numFiles=1, numRows=0, totalSize=3581, rawDataSize=0] OK Time taken: 0.33 seconds hive> select > sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning > from hike_error.ttr_day0; Query ID = hsubramaniyan_20140606134646_04790d3d-ca9a-427a-8cf9-3174536114ed Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Execution log at: /var/folders/r0/9x0wltgx2nv4m4b18m71z1y40000gr/T//hsubramaniyan/hsubramaniyan_20140606134646_04790d3d-ca9a-427a-8cf9-3174536114ed.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2014-06-06 13:47:02,043 null map = 0%, reduce = 100% Ended Job = job_local773704964_0001 Execution completed successfully MapredLocal task succeeded OK 131 Time taken: 5.325 seconds, Fetched: 1 row(s) hive> set hive.vectorized.execution.enabled=true; hive> select > sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning > from hike_error.ttr_day0; Query ID = hsubramaniyan_20140606134747_1182c765-90ac-4a33-a8b1-760adca6bf38 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Execution log at: /var/folders/r0/9x0wltgx2nv4m4b18m71z1y40000gr/T//hsubramaniyan/hsubramaniyan_20140606134747_1182c765-90ac-4a33-a8b1-760adca6bf38.log Job running in-process (local Hadoop) Hadoop job information for null: number of mappers: 0; number of reducers: 0 2014-06-06 13:47:18,604 null map = 0%, reduce = 100% Ended Job = job_local701415676_0001 Execution completed successfully MapredLocal task succeeded OK 0 Time taken: 5.52 seconds, Fetched: 1 row(s) hive> explain select > sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning > from hike_error.ttr_day0; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: ttr_day0 Statistics: Num rows: 447 Data size: 3581 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: is_returning (type: boolean), is_free (type: boolean) outputColumnNames: is_returning, is_free Statistics: Num rows: 447 Data size: 3581 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: sum(if(((is_returning = true) and (is_free = false)), 1, 0)) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Execution mode: vectorized Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 0.079 seconds, Fetched: 49 row(s) -- This message was sent by Atlassian JIRA (v6.2#6252)