[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541929#comment-16541929
 ] 

Szehon Ho edited comment on HIVE-20153 at 7/12/18 4:49 PM:
-----------------------------------------------------------

[~aihuaxu] do you think there is some way to improve this?  (I didn't yet take 
much look at this code to deeply understand).   It seems to consume memory 
whether its used in the window function or not.

The query is something like (generalizing the table):

select count(distinct), count(), count(), count(), min(), min(), max(), max(), 
min(), max() from table group by field;

Also I attach the heap dump of a mapper that was killed OOM for reference, 
there's 3 million GenericUDAFCountEvaluator, each with a hashset of 
uniqueObjects.

 

 

!Screen Shot 2018-07-12 at 6.41.28 PM.png!

 


was (Author: szehon):
[~aihuaxu] do you think there is some way to improve this?  (I didn't yet take 
much look at this code to deeply understand).   It seems to consume memory even 
if its used in the window function or not.

The query is something like (generalizing the table):

select count(distinct), count(), count(), count(), min(), min(), max(), max(), 
min(), max() from table group by field;

Also I attach the heap dump of a mapper that was killed OOM for reference, 
there's 3 million GenericUDAFCountEvaluator, each with a hashmap, I also don't 
know if that is weird or not.

 

 

!Screen Shot 2018-07-12 at 6.41.28 PM.png!

 

> Count and Sum UDF consume more memory in Hive 2+
> ------------------------------------------------
>
>                 Key: HIVE-20153
>                 URL: https://issues.apache.org/jira/browse/HIVE-20153
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 2.3.2
>            Reporter: Szehon Ho
>            Priority: Major
>         Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side much faster than in 
> Hive1.  In many queries, we have to double the memory.
>  
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to