[jira] [Commented] (HIVE-6348) Order by/Sort by in subquery

Rui Li (JIRA) Mon, 12 Jun 2017 02:53:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046396#comment-16046396
 ]


Rui Li commented on HIVE-6348:
------------------------------

Hi [~vgarg], it turns out we do need the sub query being sorted in some cases. 
An example is input20.q:
{code}
EXPLAIN
FROM (
  FROM src
  MAP src.key, src.key 
  USING 'cat'
  DISTRIBUTE BY key
  SORT BY key, value
) tmap
INSERT OVERWRITE TABLE dest1
REDUCE tmap.key, tmap.value
USING 'python input20_script.py'
AS key, value;
{code}
The query plan is:
{noformat}
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: src
            Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
            Select Operator
              expressions: key (type: string), key (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
              Transform Operator
                command: cat
                output info:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 500 Data size: 5312 Basic stats: 
COMPLETE Column stats: NONE
                  value expressions: _col0 (type: string), _col1 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), VALUE._col1 (type: string)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
          Transform Operator
            command: python input20_script.py
            output info:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
            Select Operator
              expressions: UDFToInteger(_col0) (type: int), _col1 (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE 
Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: default.dest1
{noformat}
The script {{input20_script.py}} counts the number of times each <Key, Value> 
appears and expects the input to be sorted. I'll consider how to handle such a 
case.

> Order by/Sort by in subquery
> ----------------------------
>
>                 Key: HIVE-6348
>                 URL: https://issues.apache.org/jira/browse/HIVE-6348
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gunther Hagleitner
>            Assignee: Rui Li
>            Priority: Minor
>              Labels: sub-query
>         Attachments: HIVE-6348.1.patch, HIVE-6348.2.patch, HIVE-6348.3.patch
>
>
> select * from (select * from foo order by c asc) bar order by c desc;
> in hive sorts the data set twice. The optimizer should probably remove any 
> order by/sort by in the sub query unless you use 'limit '. Could even go so 
> far as barring it at the semantic level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-6348) Order by/Sort by in subquery

Reply via email to