[jira] [Work logged] (HIVE-24579) Incorrect Result For Groupby With Limit

ASF GitHub Bot (Jira) Tue, 28 Sep 2021 12:51:36 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24579?focusedWorklogId=656446&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-656446
 ]


ASF GitHub Bot logged work on HIVE-24579:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Sep/21 19:50
            Start Date: 28/Sep/21 19:50
    Worklog Time Spent: 10m 
      Work Description: kasakrisz commented on a change in pull request #2656:
URL: https://github.com/apache/hive/pull/2656#discussion_r717404007



##########
File path: ql/src/test/results/clientpositive/llap/groupby1_limit.q.out
##########
@@ -71,33 +71,34 @@ STAGE PLANS:
                 mode: mergepartial
                 outputColumnNames: _col0, _col1
                 Statistics: Num rows: 316 Data size: 30020 Basic stats: 
COMPLETE Column stats: COMPLETE
-                Limit
-                  Number of rows: 5
-                  Statistics: Num rows: 5 Data size: 475 Basic stats: COMPLETE 
Column stats: COMPLETE
-                  Reduce Output Operator
-                    null sort order: 
-                    sort order: 
-                    Statistics: Num rows: 5 Data size: 475 Basic stats: 
COMPLETE Column stats: COMPLETE
-                    TopN Hash Memory Usage: 0.1
-                    value expressions: _col0 (type: string), _col1 (type: 
double)
+                Reduce Output Operator

Review comment:
       But we still have TopNKey operator in the Mapper (both old and new plan) 
it filters out the majority of the rows.
   
   This query has the same issue like the example in the jira: it has gby with 
limit + aggregate function in the project:
   ```
   SELECT src.key, sum(substr(src.value,5)) GROUP BY src.key LIMIT 5
   ``` 
   If no ordering is specified we may end up with incorrect aggregations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 656446)
    Time Spent: 1h 20m  (was: 1h 10m)

> Incorrect Result For Groupby With Limit
> ---------------------------------------
>
>                 Key: HIVE-24579
>                 URL: https://issues.apache.org/jira/browse/HIVE-24579
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>    Affects Versions: 2.3.7, 3.1.2, 4.0.0
>            Reporter: Nemon Lou
>            Assignee: Krisztian Kasa
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>       DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                   GatherStats: false
>                   Select Operator
>                     expressions: id (type: int)
>                     outputColumnNames: id
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                     Group By Operator
>                       aggregations: count()
>                       keys: id (type: int)
>                       mode: hash
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                         tag: -1
>                         TopN: 10
>                         TopN Hash Memory Usage: 0.1
>                         value expressions: _col1 (type: bigint)
>                         auto parallelism: true
>             Execution mode: vectorized
>             Path -> Alias:
>               file:/user/hive/warehouse/test [test]
>             Path -> Partition:
>               file:/user/hive/warehouse/test 
>                 Partition
>                   base file name: test
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   properties:
>                     COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                     bucket_count -1
>                     bucketing_version 2
>                     column.name.delimiter ,
>                     columns id
>                     columns.comments 
>                     columns.types int
>                     file.inputformat org.apache.hadoop.mapred.TextInputFormat
>                     file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     location file:/user/hive/warehouse/test
>                     name default.test
>                     numFiles 0
>                     numRows 0
>                     rawDataSize 0
>                     serialization.ddl struct test { i32 id}
>                     serialization.format 1
>                     serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     totalSize 0
>                     transient_lastDdlTime 1609730190
>                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     properties:
>                       COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>                       bucket_count -1
>                       bucketing_version 2
>                       column.name.delimiter ,
>                       columns id
>                       columns.comments 
>                       columns.types int
>                       file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>                       file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                       location file:/user/hive/warehouse/test
>                       name default.test
>                       numFiles 0
>                       numRows 0
>                       rawDataSize 0
>                       serialization.ddl struct test { i32 id}
>                       serialization.format 1
>                       serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                       totalSize 0
>                       transient_lastDdlTime 1609730190
>                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     name: default.test
>                   name: default.test
>             Truncated Path -> Alias:
>               /test [test]
>         Reducer 2 
>             Execution mode: vectorized
>             Needs Tagging: false
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                   File Output Operator
>                     compressed: false
>                     GlobalTableId: 0
>                     directory: 
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002
>                     NumFilesPerFileSink: 1
>                     Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>                     Stats Publishing Key Prefix: 
> file:/tmp/root/7160ea24-52b9-47c3-aafc-c9200263a1c6/hive_2021-01-04_14-15-27_601_190083924675700904-1/-mr-10001/.hive-staging_hive_2021-01-04_14-15-27_601_190083924675700904-1/-ext-10002/
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         properties:
>                           columns _col0,_col1
>                           columns.types int:bigint
>                           escape.delim \
>                           hive.serialization.extend.additional.nesting.levels 
> true
>                           serialization.escape.crlf true
>                           serialization.format 1
>                           serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                     TotalFiles: 1
>                     GatherStats: false
>                     MultiFileSpray: false
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> Time taken: 0.102 seconds, Fetched: 143 row(s)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24579) Incorrect Result For Groupby With Limit

Reply via email to