[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Arvind Prabhakar (JIRA) Tue, 08 Jun 2010 23:50:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876979#action_12876979
 ]


Arvind Prabhakar commented on HIVE-1139:
----------------------------------------

I did some preliminary analysis for this JIRA and converted the 
{{HashMapWrapper}} to implement the {{java.util.Map}} interface. This required 
some changes all the way down to the underlying JDBM classes. 

However, this alone is not sufficient to plug it into the {{GroupByOperator}} 
implementation because the data stored in the {{HashMap}} is a mix of 
serializable Java objects as well as {{Writable}}s. Since {{Writable}}s cannot 
be directly serialized to Java, it follows that inorder to use this for fixing 
the memory problem we need _an external serialization_ mechanism that can 
handle arbitrary mixed type object graphs.

A trivial approach to address this would be to implement custom serialization 
using Java reflection but that would incur cost of excessive reflection and 
byte handling/marshaling.

If you have any other ideas regarding this, please add it to the comments of 
this issue for consideration.


> GroupByOperator sometimes throws OutOfMemory error when there are too many 
> distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to 
> keep all distinct keys in main memory. This could leads to OOM exception when 
> there are too many distinct keys for a particular mapper. A workaround is to 
> set the map split size smaller so that each mapper takes less number of rows. 
> A better solution is to use the persistent HashMapWrapper (currently used in 
> CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Reply via email to