Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Hive makes sure that the output file is properly sorted by the column
specified in `SORT BY` clause by having only one reduce task (output) for each
partition.
```
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: __________________
Statistics: Num rows: 183663543 Data size: 313697356092 Basic
stats: COMPLETE Column stats: PARTIAL
Select Operator
expressions: __ (type: bigint), ________ (type: string), ___
(type: string), _________ (type: string), _______________ (type: string),
_______ (type: string), _____________ (type: string), ________ (type: string),
__ (type: string), _________ (type: string), ________ (type: string),
_______________ (type: string), _____________ (type: string), _____________
(type: string), ____________ (type: string), __________ (type: string), _____
(type: string), __________________ (type: string), ___ (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
_col16, _col17, _col18
Statistics: Num rows: 183663543 Data size: 313697356092 Basic
stats: COMPLETE Column stats: PARTIAL
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col18 (type: string)
Statistics: Num rows: 183663543 Data size: 313697356092
Basic stats: COMPLETE Column stats: PARTIAL
value expressions: _col1 (type: string), _col2 (type:
string), _col3 (type: string), _col4 (type: string), _col5 (type: string),
_col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type:
string), _col10 (type: string), _col11 (type: string), _col12 (type: string),
_col13 (type: string), _col14 (type: string), _col15 (type: string), _col16
(type: string), _col17 (type: string), _col18 (type: string)
Execution mode: vectorized
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0
(type: string), VALUE._col1 (type: string), VALUE._col2 (type: string),
VALUE._col3 (type: string), VALUE._col4 (type: string), VALUE._col5 (type:
string), VALUE._col6 (type: string), VALUE._col7 (type: string), VALUE._col8
(type: string), VALUE._col9 (type: string), VALUE._col10 (type: string),
VALUE._col11 (type: string), VALUE._col12 (type: string), VALUE._col13 (type:
string), VALUE._col14 (type: string), VALUE._col15 (type: string), VALUE._col16
(type: string), VALUE._col17 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15,
_col16, _col17, _col18
Statistics: Num rows: 183663543 Data size: 33794091912 Basic
stats: COMPLETE Column stats: PARTIAL
File Output Operator
compressed: false
Statistics: Num rows: 183663543 Data size: 33794091912 Basic
stats: COMPLETE Column stats: PARTIAL
table:
input format:
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format:
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: _______________.________________
```
The later stage simply moves the files to the corresponding directories.
Since the patch no longer merges and I think I have made my point, I'm
closing this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]