[jira] [Updated] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Xuefu Zhang (JIRA) Thu, 19 Jan 2017 11:10:01 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xuefu Zhang updated HIVE-15580:
-------------------------------
    Description: 
Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. 
For orderBy, Hive accumulates key groups using ArrayList (described in 
HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
which has a shortcoming of not being able to spill to disk within a key group. 
Thus, for large key group, memory usage is also unbounded.

It's likely that this will impact performance. We will profile and optimize 
afterwards. We could also make this change configurable.

  was:Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded 
memory. For orderBy, Hive accumulates key groups using ArrayList (described in 
HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
which has a shortcoming of not being able to spill to disk within a key group. 
Thus, for large key group, memory usage is also unbounded.


> Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
> -------------------------------------------------------------------------
>
>                 Key: HIVE-15580
>                 URL: https://issues.apache.org/jira/browse/HIVE-15580
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, 
> HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch
>
>
> Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded 
> memory. For orderBy, Hive accumulates key groups using ArrayList (described 
> in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
> which has a shortcoming of not being able to spill to disk within a key 
> group. Thus, for large key group, memory usage is also unbounded.
> It's likely that this will impact performance. We will profile and optimize 
> afterwards. We could also make this change configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

Reply via email to