Hari Sankar Sivarama Subramaniyan created HIVE-7188:
-------------------------------------------------------

             Summary: sum(if()) returns wrong results with vectorization
                 Key: HIVE-7188
                 URL: https://issues.apache.org/jira/browse/HIVE-7188
             Project: Hive
          Issue Type: Bug
            Reporter: Hari Sankar Sivarama Subramaniyan
            Assignee: Hari Sankar Sivarama Subramaniyan
         Attachments: hike-vector-sum-bug.tgz

1. The tgz file containing the setup is attached.
2. Run the following query
select
    sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning
    from hike_error.ttr_day0;

returns 0 rows with vectorization turned on whereas it return 131 rows with 
vectorization turned off.



hive> source insert.sql
    > ;
OK
Time taken: 0.359 seconds
OK
Time taken: 0.015 seconds
OK
Time taken: 0.069 seconds
OK
Time taken: 0.176 seconds
Loading data to table hike_error.ttr_day0
Table hike_error.ttr_day0 stats: [numFiles=1, numRows=0, totalSize=3581, 
rawDataSize=0]
OK
Time taken: 0.33 seconds
hive> select
    >     sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning
    >     from hike_error.ttr_day0;
Query ID = hsubramaniyan_20140606134646_04790d3d-ca9a-427a-8cf9-3174536114ed
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Execution log at: 
/var/folders/r0/9x0wltgx2nv4m4b18m71z1y40000gr/T//hsubramaniyan/hsubramaniyan_20140606134646_04790d3d-ca9a-427a-8cf9-3174536114ed.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-06-06 13:47:02,043 null map = 0%,  reduce = 100%
Ended Job = job_local773704964_0001
Execution completed successfully
MapredLocal task succeeded
OK
131
Time taken: 5.325 seconds, Fetched: 1 row(s)
hive> set hive.vectorized.execution.enabled=true;                               
                                 
hive> select
    >     sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning
    >     from hike_error.ttr_day0;
Query ID = hsubramaniyan_20140606134747_1182c765-90ac-4a33-a8b1-760adca6bf38
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Execution log at: 
/var/folders/r0/9x0wltgx2nv4m4b18m71z1y40000gr/T//hsubramaniyan/hsubramaniyan_20140606134747_1182c765-90ac-4a33-a8b1-760adca6bf38.log
Job running in-process (local Hadoop)
Hadoop job information for null: number of mappers: 0; number of reducers: 0
2014-06-06 13:47:18,604 null map = 0%,  reduce = 100%
Ended Job = job_local701415676_0001
Execution completed successfully
MapredLocal task succeeded
OK
0
Time taken: 5.52 seconds, Fetched: 1 row(s)
hive> explain select
    >     sum(if(is_returning=true and is_free=false,1,0)) as unpaid_returning
    >     from hike_error.ttr_day0;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: ttr_day0
            Statistics: Num rows: 447 Data size: 3581 Basic stats: COMPLETE 
Column stats: NONE
            Select Operator
              expressions: is_returning (type: boolean), is_free (type: boolean)
              outputColumnNames: is_returning, is_free
              Statistics: Num rows: 447 Data size: 3581 Basic stats: COMPLETE 
Column stats: NONE
              Group By Operator
                aggregations: sum(if(((is_returning = true) and (is_free = 
false)), 1, 0))
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  sort order: 
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  value expressions: _col0 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: NONE
          Select Operator
            expressions: _col0 (type: bigint)
            outputColumnNames: _col0
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column 
stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 0.079 seconds, Fetched: 49 row(s)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to