[ 
https://issues.apache.org/jira/browse/HIVE-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-10458:
--------------------------
    Attachment: HIVE-10458.3-spark.patch

I found that hive already has parallel order by on MR (HIVE-1402), which makes 
use of Hadoop's {{TotalOrderPartitioner}}. For spark, we should honor the flag 
{{hive.optimize.sampling.orderby}} which controls whether parallel order by is 
enabled.
With multiple reducers, we'll have multiple output files of the sorted data. To 
have global order, these files need to be read in proper order. My 
understanding is that the file order is maintained by the file name, i.e. each 
file name has the partition ID of the reducer, which is determined by the 
partitioner. If so, we don't have to do anything special for spark. But I'm not 
sure about this. [~xuefuz] do you have any ideas?

> Enable parallel order by for spark [Spark Branch]
> -------------------------------------------------
>
>                 Key: HIVE-10458
>                 URL: https://issues.apache.org/jira/browse/HIVE-10458
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Rui Li
>            Assignee: Rui Li
>         Attachments: HIVE-10458.1-spark.patch, HIVE-10458.2-spark.patch, 
> HIVE-10458.3-spark.patch
>
>
> We don't have to force reducer# to 1 as spark supports parallel sorting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to