[
https://issues.apache.org/jira/browse/HIVE-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046396#comment-16046396
]
Rui Li commented on HIVE-6348:
------------------------------
Hi [~vgarg], it turns out we do need the sub query being sorted in some cases.
An example is input20.q:
{code}
EXPLAIN
FROM (
FROM src
MAP src.key, src.key
USING 'cat'
DISTRIBUTE BY key
SORT BY key, value
) tmap
INSERT OVERWRITE TABLE dest1
REDUCE tmap.key, tmap.value
USING 'python input20_script.py'
AS key, value;
{code}
The query plan is:
{noformat}
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: src
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: key (type: string), key (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
Transform Operator
command: cat
output info:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 500 Data size: 5312 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: string)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: string), VALUE._col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
Transform Operator
command: python input20_script.py
output info:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
Select Operator
expressions: UDFToInteger(_col0) (type: int), _col1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE
Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.dest1
{noformat}
The script {{input20_script.py}} counts the number of times each <Key, Value>
appears and expects the input to be sorted. I'll consider how to handle such a
case.
> Order by/Sort by in subquery
> ----------------------------
>
> Key: HIVE-6348
> URL: https://issues.apache.org/jira/browse/HIVE-6348
> Project: Hive
> Issue Type: Bug
> Reporter: Gunther Hagleitner
> Assignee: Rui Li
> Priority: Minor
> Labels: sub-query
> Attachments: HIVE-6348.1.patch, HIVE-6348.2.patch, HIVE-6348.3.patch
>
>
> select * from (select * from foo order by c asc) bar order by c desc;
> in hive sorts the data set twice. The optimizer should probably remove any
> order by/sort by in the sub query unless you use 'limit '. Could even go so
> far as barring it at the semantic level.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)