RE: Combine() optimization

Namit Jain Fri, 27 Feb 2009 07:57:40 -0800

Look at the patch for


http://issues.apache.org/jira/browse/HIVE-223



It has not been committed yet.


Thanks,
-namit

________________________________________
From: Qing Yan [[email protected]]
Sent: Friday, February 27, 2009 12:12 AM
To: [email protected]
Subject: Re: Combine() optimization

Ouch, I was getting tons of exceptions after turning on map-side aggregation:

java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
at java.lang.StringCoding.encode(StringCoding.java:272)
at java.lang.String.getBytes(String.java:947)
at 
org.apache.hadoop.hive.serde2.thrift.TBinarySortableProtocol.writeString(TBinarySortableProtocol.java:299)
at 
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeTypeString.serialize(DynamicSerDeTypeString.java:65)
at 
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeFieldList.serialize(DynamicSerDeFieldList.java:249)
at 
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeStructBase.serialize(DynamicSerDeStructBase.java:81)
at 
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.serialize(DynamicSerDe.java:174)
at 
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:153)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:306)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:564)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:552)
at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

java.io.IOException: Task process exit with nonzero status of 1.
...
Just to confirm is this just a bug or is by design ?
On Fri, Feb 27, 2009 at 10:02 AM, Namit Jain 
<[email protected]<mailto:[email protected]>> wrote:

Yes, it flushes the data when the hash table is occupying too much memory





From: Qing Yan [mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, February 26, 2009 5:58 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Combine() optimization



Got it.



Does map side aggregation has any special requirement about the dataset? E.g. 
The number of unqiue group by keys could be too big to hold in memory. Will it 
still work?

On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao 
<[email protected]<mailto:[email protected]>> wrote:

Hi Qing,

We did think about Combiner when we started Hive. However earlier discussions 
lead us to believe that hash-based aggregation inside the mapper will be as 
competitive as using combiner in most cases.

In order to enable map-side aggregation, we just need to do the following 
before running the hive query:
set hive.map.aggr=true;

Zheng



On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy 
<[email protected]<mailto:[email protected]>> wrote:

Right now Hive does not exploit the combiner. But hash-based map-side
aggregation in hive (controlled by hints) provides a similar optimization.
Using the combiner in addition to map-side aggregation should improve the
performance even more if the combiner can further aggregate the partial
aggregates generated from the mapper.


On 2/26/09 5:57 AM, "Qing Yan" <[email protected]<mailto:[email protected]>> 
wrote:

> Is there any way/plan for Hive to take advantage of M/R's combine()
> phrase? There can be either rules embedded in in the query optimizer  or hints
> passed by user...
> GROUP BY should benefit from this alot..
>
> Any comment?
>
>
>



--
Yours,
Zheng

RE: Combine() optimization

Reply via email to