[ 
https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-13293:
--------------------------
    Attachment: HIVE-13293.1.patch

I have tried both splitting the task and caching the RDD and chose the latter 
here. Because it's simpler and works with queries that have only one 
ShuffleMapStage. Regarding performance, these two solutions provide roughly 
same performance in my local tests. I used DISK_ONLY as storage level which I 
think is good enough for performance and avoids more memory overhead.
Lifeng, could you help test the patch with your data set? Thanks.

> Query occurs performance degradation after enabling parallel order by for 
> Hive on Spark
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-13293
>                 URL: https://issues.apache.org/jira/browse/HIVE-13293
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Lifeng Wang
>            Assignee: Rui Li
>         Attachments: HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found 
> query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to