[jira] [Created] (HIVE-7261) Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously

Chris Chen (JIRA) Thu, 19 Jun 2014 20:23:33 -0700

Chris Chen created HIVE-7261:
--------------------------------

             Summary: Calculation works wrong when hive.groupby.skewindata  is 
true and count(*) count(distinct) group by work simultaneously 
                 Key: HIVE-7261
                 URL: https://issues.apache.org/jira/browse/HIVE-7261
             Project: Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.12.0
         Environment: hive0.12   hadoop1.0.4
            Reporter: Chris Chen



【Phenomenon】
The query results are not the same as when hive.groupby.skewindata was setted 
to true and false.

【my question】
I want to calculate the count(*) and count(distinct) simultaneously ,otherwise 
it will cost 2 MR job to calculate. But when i set the hive.groupby.skewindata 
to be true, the count(*) result shoud not be same as the count(distinct) , but 
the real result is same, so it's wrong. And I find the difference of its query 
plan which the "Reduce Operator Tree->Group By Operator->mode"  is mergepartial 
when skew is set to false and 
"Reduce Operator Tree->Group By Operator->mode"  is complete when skew is set 
to true. So i'm confused the root cause of the error.

【sql】
select ds,appid,eventname,active,{color:red}count(distinct(guid)), count(*) 
{color}from eventinfo_tmp where ds='20140612' and length(eventname)<1000 and 
eventname like '%alibaba%' group by ds,appid,eventname,active;

【the others hive configaration exclude hive.groupby.skewindata】
hive.exec.compress.output=true
hive.exec.compress.intermediate=true
io.seqfile.compression.type=BLOCK
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
hive.map.aggr=true
hive.stats.autogather=false
hive.exec.scratchdir=/user/complat/tmp
mapred.job.queue.name=complat
hive.exec.mode.local.auto=false
hive.exec.mode.local.auto.inputbytes.max=500
hive.exec.mode.local.auto.tasks.max=10
hive.exec.mode.local.auto.input.files.max=1000
hive.exec.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
mapred.max.split.size=100000000
mapred.min.split.size.per.node=100000000
mapred.min.split.size.per.rack=100000000

【result】
when hive.groupby.skewindata=true  the result is :
20140612        8       alibaba 1       {color:red}87   147{color}

when it=false the result is : 
20140612        8       alibaba 1       {color:red}87   87{color}

【query plan】
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR 
(TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) 
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR 
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) 
'20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) (like 
(TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY 
(TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) 
(TOK_TABLE_OR_COL active))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        eventinfo_tmp 
          TableScan
            alias: eventinfo_tmp
            Filter Operator
              predicate:
                  expr: ((length(eventname) < 1000) and (eventname like 
'%tvvideo_setting%'))
                  type: boolean
              Select Operator
                expressions:
                      expr: ds
                      type: string
                      expr: appid
                      type: string
                      expr: eventname
                      type: string
                      expr: active
                      type: int
                      expr: guid
                      type: string
                outputColumnNames: ds, appid, eventname, active, guid
                Group By Operator
                  aggregations:
                        expr: count(DISTINCT guid)
                        expr: count()
                  bucketGroup: false
                  keys:
                        expr: ds
                        type: string
                        expr: appid
                        type: string
                        expr: eventname
                        type: string
                        expr: active
                        type: int
                        expr: guid
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                          expr: _col2
                          type: string
                          expr: _col3
                          type: int
                          expr: _col4
                          type: string
                    sort order: +++++
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                          expr: _col2
                          type: string
                          expr: _col3
                          type: int
                    tag: -1
                    value expressions:
                          expr: _col5
                          type: bigint
                          expr: _col6
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(DISTINCT KEY._col4:0._col0)
                expr: count(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: string
                expr: KEY._col2
                type: string
                expr: KEY._col3
                type: int
          mode: {color:red}mergepartial{color}
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: string
                  expr: _col2
                  type: string
                  expr: _col3
                  type: int
                  expr: _col4
                  type: bigint
                  expr: _col5
                  type: bigint
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL appid)) (TOK_SELEXPR 
(TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active)) 
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR 
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) 
'20140612') (< (TOK_FUNCTION length (TOK_TABLE_OR_COL eventname)) 1000)) (like 
(TOK_TABLE_OR_COL eventname) '%tvvideo_setting%'))) (TOK_GROUPBY 
(TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) 
(TOK_TABLE_OR_COL active))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        eventinfo_tmp 
          TableScan
            alias: eventinfo_tmp
            Filter Operator
              predicate:
                  expr: ((length(eventname) < 1000) and (eventname like 
'%tvvideo_setting%'))
                  type: boolean
              Select Operator
                expressions:
                      expr: ds
                      type: string
                      expr: appid
                      type: string
                      expr: eventname
                      type: string
                      expr: active
                      type: int
                      expr: guid
                      type: string
                outputColumnNames: ds, appid, eventname, active, guid
                Group By Operator
                  aggregations:
                        expr: count(DISTINCT guid)
                        expr: count()
                  bucketGroup: false
                  keys:
                        expr: ds
                        type: string
                        expr: appid
                        type: string
                        expr: eventname
                        type: string
                        expr: active
                        type: int
                        expr: guid
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                          expr: _col2
                          type: string
                          expr: _col3
                          type: int
                          expr: _col4
                          type: string
                    sort order: +++++
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                          expr: _col1
                          type: string
                          expr: _col2
                          type: string
                          expr: _col3
                          type: int
                    tag: -1
                    value expressions:
                          expr: _col5
                          type: bigint
                          expr: _col6
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(DISTINCT KEY._col4:0._col0)
                expr: count(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: string
                expr: KEY._col2
                type: string
                expr: KEY._col3
                type: int
          mode: {color:red}complete{color}
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: string
                  expr: _col2
                  type: string
                  expr: _col3
                  type: int
                  expr: _col4
                  type: bigint
                  expr: _col5
                  type: bigint
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-7261) Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously

Reply via email to