Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 Hive makes sure that the output file is properly sorted by the column specified in `SORT BY` clause by having only one reduce task (output) for each partition. ``` STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: __________________ Statistics: Num rows: 183663543 Data size: 313697356092 Basic stats: COMPLETE Column stats: PARTIAL Select Operator expressions: __ (type: bigint), ________ (type: string), ___ (type: string), _________ (type: string), _______________ (type: string), _______ (type: string), _____________ (type: string), ________ (type: string), __ (type: string), _________ (type: string), ________ (type: string), _______________ (type: string), _____________ (type: string), _____________ (type: string), ____________ (type: string), __________ (type: string), _____ (type: string), __________________ (type: string), ___ (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18 Statistics: Num rows: 183663543 Data size: 313697356092 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator key expressions: _col0 (type: bigint) sort order: + Map-reduce partition columns: _col18 (type: string) Statistics: Num rows: 183663543 Data size: 313697356092 Basic stats: COMPLETE Column stats: PARTIAL value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 (type: string) Execution mode: vectorized Reduce Operator Tree: Select Operator expressions: KEY.reducesinkkey0 (type: bigint), VALUE._col0 (type: string), VALUE._col1 (type: string), VALUE._col2 (type: string), VALUE._col3 (type: string), VALUE._col4 (type: string), VALUE._col5 (type: string), VALUE._col6 (type: string), VALUE._col7 (type: string), VALUE._col8 (type: string), VALUE._col9 (type: string), VALUE._col10 (type: string), VALUE._col11 (type: string), VALUE._col12 (type: string), VALUE._col13 (type: string), VALUE._col14 (type: string), VALUE._col15 (type: string), VALUE._col16 (type: string), VALUE._col17 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18 Statistics: Num rows: 183663543 Data size: 33794091912 Basic stats: COMPLETE Column stats: PARTIAL File Output Operator compressed: false Statistics: Num rows: 183663543 Data size: 33794091912 Basic stats: COMPLETE Column stats: PARTIAL table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: _______________.________________ ``` The later stage simply moves the files to the corresponding directories. Since the patch no longer merges and I think I have made my point, I'm closing this.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org