Re: Combine() optimization

Qing Yan Fri, 27 Feb 2009 00:13:18 -0800

Ouch, I was getting tons of exceptions after turning on map-side
aggregation:


java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
at java.lang.StringCoding.encode(StringCoding.java:272)
at java.lang.String.getBytes(String.java:947)
at
org.apache.hadoop.hive.serde2.thrift.TBinarySortableProtocol.writeString(TBinarySortableProtocol.java:299)
at
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeTypeString.serialize(DynamicSerDeTypeString.java:65)
at
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeFieldList.serialize(DynamicSerDeFieldList.java:249)
at
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeStructBase.serialize(DynamicSerDeStructBase.java:81)
at
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.serialize(DynamicSerDe.java:174)
at
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:153)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:306)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:564)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:552)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

java.io.IOException: Task process exit with nonzero status of 1.
...
Just to confirm is this just a bug or is by design ?
On Fri, Feb 27, 2009 at 10:02 AM, Namit Jain <[email protected]> wrote:

>  Yes, it flushes the data when the hash table is occupying too much memory
>
>
>
>
>
> *From:* Qing Yan [mailto:[email protected]]
> *Sent:* Thursday, February 26, 2009 5:58 PM
> *To:* [email protected]
> *Subject:* Re: Combine() optimization
>
>
>
> Got it.
>
>
>
> Does map side aggregation has any special requirement about the dataset?
> E.g. The number of unqiue group by keys could be too big to hold
> in memory. Will it still work?
>
> On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao <[email protected]> wrote:
>
> Hi Qing,
>
> We did think about Combiner when we started Hive. However earlier
> discussions lead us to believe that hash-based aggregation inside the mapper
> will be as competitive as using combiner in most cases.
>
> In order to enable map-side aggregation, we just need to do the following
> before running the hive query:
> set hive.map.aggr=true;
>
> Zheng
>
>
>
> On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy <[email protected]> wrote:
>
> Right now Hive does not exploit the combiner. But hash-based map-side
> aggregation in hive (controlled by hints) provides a similar optimization.
> Using the combiner in addition to map-side aggregation should improve the
> performance even more if the combiner can further aggregate the partial
> aggregates generated from the mapper.
>
>
>
> On 2/26/09 5:57 AM, "Qing Yan" <[email protected]> wrote:
>
> > Is there any way/plan for Hive to take advantage of M/R's combine()
> > phrase? There can be either rules embedded in in the query optimizer  or
> hints
> > passed by user...
> > GROUP BY should benefit from this alot..
> >
> > Any comment?
> >
> >
> >
>
>
>
> --
> Yours,
> Zheng
>
>
>

Re: Combine() optimization

Reply via email to